This section loads the necessary libraries for the analysis and sets some initial parameters such as color mapping and random seed for reproducibility. Loading the libraries ensures that all required functions and tools are available for subsequent data processing and analysis steps. Setting a random seed ensures that any random operations produce the same results each time the code is run, enhancing reproducibility.
In this section, the raw data files are loaded into R. This includes the main dataset and the metadata file, which will be used for subsequent analysis steps. The data loading step ensures that the necessary datasets are available in the R environment for processing. The final output directory is also defined here, which will store the results of the analysis. If the directory does not exist, it will be created.
# Adjust path as necessary for your file locations
dataset_unsorted <- fread("data/report.pg_matrix.tsv")[, ID := .I]
meta_data <- fread("data/SampleDescription.csv")
# Define the final output directory or set NULL if results should note be saved
final_out_dir <- "results"
# Creates the folder if it doesn't exist
if(!is.null(final_out_dir)){dir.create(final_out_dir, recursive = TRUE, showWarnings = FALSE)}
The column names of the loaded dataset are cleaned up in this step. This involves removing specific prefixes and suffixes to standardize the column names, making them easier to work with. Standardizing column names helps in avoiding confusion and errors in subsequent data processing steps.
# Clean up column names in dataset
# Remove specific prefix and suffix from column names
prefix <- "/mnt/ag_proteomik/shared_worktmp/20240126_CaGe_Knochenextraktion ohne Label_Thess/Knochenextrakte_"
suffix <- "_Slot.*"
names(dataset_unsorted) <- sub(suffix, "", sub(prefix, "", names(dataset_unsorted)))
# Perform the replacements
meta_data[Method == "2step+", Method := "2step+M"]
meta_data[Method == "1step+", Method := "1step+C"]
In this section, the metadata is prepared for sorting, and the main dataset’s columns are reordered based on this sorted metadata. This ensures that the dataset columns follow a meaningful order, aligning with the metadata. Sorting the columns in a consistent order helps in maintaining data integrity and makes further analysis more straightforward.
# Prepare meta_data for sorting
# Extract numerical part from 'Column' for sorting purposes
meta_data[, M_value := as.numeric(sub("M", "", sub("-.*", "", Column)))]
# Define custom order for Buffer_type and apply it
buffer_order <- c("AS", "AI", "AS1", "AI1", "AI2", "AS2", "single")
meta_data[, Buffer_name := factor(Buffer_name, levels = buffer_order)]
# Sort meta_data based on multiple criteria
ordered_meta_data <- meta_data[order(M_value, Repl_Techn, Repl_Biol, Buffer_name)]
# Sort column names in dataset_unsorted
descr_columns <- names(dataset_unsorted)[!names(dataset_unsorted) %in% ordered_meta_data$Column]
new_order <- c(descr_columns, ordered_meta_data$Column)
dataset_sorted <- dataset_unsorted[, ..new_order]
This completes the setup and initial data preparation steps, ensuring that the data is properly loaded, cleaned, and organized for further analysis.
This normalization technique adjusts the values in each column based on the average intensity observed across similar types of samples (as grouped by Sample_Type). By scaling each column to match the average, the method attempts to correct for any discrepancies in sample loading or measurement that might have resulted in abnormally high or low readings. The goal is to bring all samples to a common scale, making it easier to compare results across different samples or conditions without the confounding effect of varying starting intensities.
First, we define a function to calculate the intensity sums. This function filters out rows that have any missing values (NAs) in the columns matching the pattern “^M[1-5]” and then calculates the sum of each of these columns. This helps ensure that only complete cases are considered in the sum calculations.
# Define the function to calculate intensity sums
calculate_intensity_sums <- function(df) {
# Apply filter and selection operations
filtered_df <- df %>%
# Filter out rows that have any NAs in columns matching "^M[1-5]"
filter(rowSums(is.na(select(., matches("^M[1-5]")))) == 0)
# Print the number of rows after filtering
message("Number of rows after filtering: ", nrow(filtered_df))
# Select columns that match the pattern "^M[1-5]"
# and calculate the sum of each selected column
intensity_sums <- filtered_df %>%
select(matches("^M[1-5]")) %>%
colSums()
# Return the resulting sums
return(intensity_sums)
}
Next, we define a function to calculate normalization factors. This function first calculates the column sums using the previously defined function. It then computes the normalization factors for each unique sample type by dividing the median sum of each sample type by the respective intensity sum.
calculate_norm_fact <- function(df, sampledescription) {
# Calculate column sums after filtering NA values
intensity_sums <- calculate_intensity_sums(df)
# Create a mapping from column names to sample types
col_sample_type <- sampledescription$Sample_Type[match(names(intensity_sums), sampledescription$Column)]
# Compute normalization factors for each unique sample type
norm_fact <- sapply(unique(col_sample_type), function(st) {
median_sum <- median(intensity_sums[sampledescription$Sample_Type == st])
norm_fact <- median_sum / intensity_sums[sampledescription$Sample_Type == st]
return(norm_fact)
})
# Unlist normalization factors and adjust names
norm_fact <- setNames(
unlist(norm_fact),
sub(".*\\.", "", names(unlist(norm_fact)))
)
# Return the calculated normalization factors
return(norm_fact)
}
Finally, we define a function to apply the normalization factors to the dataframe. This function ensures that the normalization factors’ names directly match the dataframe’s column names and then multiplies each column by its corresponding normalization factor.
normalize_df <- function(df, norm_fact) {
# Ensure the normalization factors' names directly match the df's column names
norm_cols <- names(norm_fact)
# Loop through each column name that needs normalization
for (col_name in norm_cols) {
if (col_name %in% names(df)) {
# Multiply the column by its corresponding normalization factor
df[[col_name]] <- df[[col_name]] * norm_fact[col_name]
}
}
return(df)
}
Putting it all together, we can now normalize our dataset. First, we calculate the normalization factors, and then we apply these factors to our dataset.
norm_fact <- calculate_norm_fact(dataset_sorted, ordered_meta_data)
## Number of rows after filtering: 132
dataset_norm <- normalize_df(dataset_sorted, norm_fact)
write.csv(dataset_norm, file.path(final_out_dir, "normalized_dataset.csv"), row.names = FALSE)
This completes the normalization process, ensuring that all samples are brought to a common scale for more accurate comparison of results across different samples or conditions. A total of 132 rows were used, as only those had measured intensities in all samples.
# Melt the datasets for ggplot2
dataset_sorted_melt <- melt(dataset_sorted, id.vars = c("Protein.Group", "Protein.Ids", "Protein.Names", "Genes", "First.Protein.Description", "ID"))
dataset_norm_melt <- melt(dataset_norm, id.vars = c("Protein.Group", "Protein.Ids", "Protein.Names", "Genes", "First.Protein.Description", "ID"))
# Add a column to distinguish between normalized and unnormalized
dataset_sorted_melt$Type <- "Unnormalized"
dataset_norm_melt$Type <- "Normalized"
# Combine the datasets
combined_data <- rbind(dataset_sorted_melt, dataset_norm_melt)
# Plot the boxplots
p <- ggplot(combined_data, aes(x = variable, y = log10(value), fill = Type)) +
geom_boxplot(outlier.shape = NA) + # Remove outliers for clarity
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Boxplot of Intensities: Normalized vs Unnormalized", x = "Sample", y = "log10 Intensity")
ggsave(file.path(final_out_dir, "Intensitied_pre_and_post_normalization.png"), plot = p)
## Saving 7 x 5 in image
## Warning: Removed 336184 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
print(p)
## Warning: Removed 336184 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
# Melt dataset_norm into a long format
dataset_long <- data.table::melt(dataset_norm,
id.vars = descr_columns,
variable.name = "Sample.ID",
value.name = "maxLfQ_intensities",
variable.factor = FALSE)
# Merge dataset_long with meta_data
dataset_merged <- merge(dataset_long, meta_data, by.x = "Sample.ID", by.y = "Column", all.x = TRUE)
write_tsv(dataset_merged, file.path(final_out_dir, "report.pg_matrix_long.tsv"))
In this step, we filter the rows of the dataset to retain only those entries that are present in at least 3 out of 4 biological replicates for each combination of method and buffer type. This ensures that our analysis focuses on consistently detected proteins across multiple biological replicates, enhancing the reliability of the results.
# Filter rows where technical replicate is 1 and maxLfQ_intensities is not NA
dataset_filtered <- dataset_merged[Repl_Techn == 1 & !is.na(maxLfQ_intensities)]
# Count unique biological replicates for each combination of method, buffer type, and ID
dataset_filtered_repl <- dataset_filtered[, .(Unique_Repl_Biol_Count = uniqueN(Repl_Biol)), by = .(Method, Buffer_name, Buffer_type, ID)]
# Filter rows where the count of unique biological replicates is at least 3
dataset_filtered_in_3_of_4_repl <- dataset_filtered_repl[Unique_Repl_Biol_Count >= 3]
dataset_filtered_in_3_of_4 <- merge(
dataset_filtered,
dataset_filtered_in_3_of_4_repl[, .(Method, Buffer_name, ID)],
by = c("Method", "Buffer_name", "ID")
)
First, we calculate various statistics for each sample type. This includes median LFQ, mean LFQ, standard deviation, and coefficient of variation. Additionally, we extract the intensities for each biological replicate.
# Calculate statistics for each sample type
stats_df <- dataset_filtered_in_3_of_4 %>%
group_by(ID, Protein.Group, Genes, First.Protein.Description, Method, Buffer_name) %>%
summarize(
#MedianLFQ = median(maxLfQ_intensities, na.rm = TRUE),
MeanLFQ = mean(maxLfQ_intensities, na.rm = TRUE),
SDLFQ = sd(maxLfQ_intensities, na.rm = TRUE),
CVLFQ = (sd(maxLfQ_intensities, na.rm = TRUE) / mean(maxLfQ_intensities, na.rm = TRUE)) * 100,
.groups = 'drop'
)
# Prepend "M" to the existing M_value in meta_data
meta_data <- meta_data %>% mutate(M = paste0("M", M_value))
# Calculate the intensities for each biological replicate
replicate_intensities <- dataset_filtered %>%
group_by(ID, Protein.Group, Genes, First.Protein.Description, M_value, Buffer_name, Repl_Biol) %>%
summarize(Intensity = mean(maxLfQ_intensities, na.rm = TRUE), .groups = 'drop') %>%
mutate(M = paste0("M", M_value)) %>%
mutate(B = paste0("B", Repl_Biol)) %>%
unite("M_B_Repl", M, Buffer_name, B, sep = "-") %>%
select(-Repl_Biol, -M_value) %>%
pivot_wider(names_from = M_B_Repl, values_from = Intensity)
# Create a mapping for Method to the corresponding M value
method_mapping <- unique(meta_data[, .(Method, M)])
# Add a new column M-B to stats_df using the mapping
stats_df_long <- stats_df %>%
group_by(ID) %>%
summarize(Occurrence = paste(unique(Method), collapse = ";")) %>%
right_join(stats_df, by = "ID") %>%
left_join(method_mapping, by = "Method") %>%
mutate(M_B = paste0(M, "-", Buffer_name)) %>%
select(-Method, -Buffer_name) %>%
pivot_longer(cols = c(MeanLFQ, SDLFQ, CVLFQ), names_to = "Metric", values_to = "Value") %>%
unite("M_B_Metric", M_B, Metric, sep = "_") %>%
select(-M) %>%
pivot_wider(names_from = M_B_Metric, values_from = Value) %>%
relocate(Occurrence, .after = First.Protein.Description) # Moves Occurrence to the 5th position
# Merge with replicate intensities and arrange columns
final_df <- stats_df_long %>%
right_join(replicate_intensities, by = c("ID", "Protein.Group", "Genes", "First.Protein.Description")) %>%
arrange(ID)
Next, we dynamically generate the column order to ensure the columns are sorted first by replicate and then by metrics for each Method x Buffer combination.
# Filter to only include Repl_Techn == 1
ordered_meta_copy <- ordered_meta_data %>%
filter(Repl_Techn == 1) %>%
arrange(Method, Buffer_name, Repl_Biol)
# Function to generate columns for each Method x Buffer combination
generate_columns <- function(method, buffer, m_value) {
replicate_cols <- paste0("M", m_value, "-", buffer, "-B", 1:4)
metric_cols <- paste0("M", m_value, "-", buffer, "_", c("MeanLFQ", "SDLFQ", "CVLFQ"))
c(replicate_cols, metric_cols)
}
method_buffer_combinations <- ordered_meta_copy %>%
distinct(Method, Buffer_name, M_value)
# Generate the column order dynamically
all_cols <- c("ID", "Protein.Group", "Genes", "First.Protein.Description", "Occurrence")
all_cols <- unlist(lapply(1:nrow(method_buffer_combinations), function(i) {
generate_columns(method_buffer_combinations$Method[i],
method_buffer_combinations$Buffer_name[i],
method_buffer_combinations$M_value[i])
}))
# Ensure that all_cols contains only the columns present in final_df
all_cols <- c("ID", "Protein.Group", "Genes", "First.Protein.Description", "Occurrence", all_cols[all_cols %in% names(final_df)])
# Select columns in the specified order
final_df <- final_df %>%
select(all_of(all_cols))
Finally, we save the sorted dataframe to a CSV file.
# Save the final dataframe to CSV
write.csv(final_df, file.path(final_out_dir, "statistics_df.csv"), row.names = FALSE)
library("UniprotR")
library(Biostrings)
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following objects are masked from 'package:dplyr':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
## tapply, union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:data.table':
##
## first, second
## The following object is masked from 'package:tidyr':
##
## expand
## The following objects are masked from 'package:dplyr':
##
## first, rename
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:purrr':
##
## reduce
## The following object is masked from 'package:data.table':
##
## shift
## The following objects are masked from 'package:dplyr':
##
## collapse, desc, slice
## Loading required package: XVector
##
## Attaching package: 'XVector'
## The following object is masked from 'package:purrr':
##
## compact
## Loading required package: GenomeInfoDb
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
# Define the path to the FASTA file
fasta_file <- "data/UP000002494.fasta"
# Read the FASTA file
fasta_sequences <- readAAStringSet(fasta_file)
# Extract Protein IDs and sequences
protein_ids <- names(fasta_sequences)
sequences <- as.character(fasta_sequences)
# Create a data frame mapping Protein IDs to sequences
protein_sequence_map <- data.frame(ProteinID = protein_ids, Sequence = sequences, stringsAsFactors = FALSE)
# Split the ProteinID by '|' and extract the second element
protein_sequence_map$ProteinID <- sapply(strsplit(protein_sequence_map$ProteinID, "\\|"), `[`, 2)
result_df <- final_df[c("ID", "Protein.Group", "Genes", "First.Protein.Description", "Occurrence")] %>%
mutate(Protein.ID = Protein.Group) %>%
separate_rows(Protein.ID, sep = ";") %>%
distinct() %>%
left_join(protein_sequence_map, by = c("Protein.ID" = "ProteinID"))
library("Peptides")
# Function to calculate properties for a sequence
calculate_properties <- function(seq) {
if (is.na(seq)) {
return(data.frame(pI = NA, mw = NA, gravy = NA))
}
pI_value <- pI(seq = seq, pKscale = "Bjellqvist")
mw_value <- mw(seq = seq, monoisotopic = FALSE)
gravy_value <- hydrophobicity(seq = seq, scale = "KyteDoolittle")
return(data.frame(pI = pI_value, mw = mw_value, gravy = gravy_value))
}
# Apply the function to each sequence in the data frame
properties_df <- do.call(rbind, lapply(result_df$Sequence, calculate_properties))
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
# Combine the properties with the original data frame
result_with_properties_df <- cbind(result_df, properties_df)
attributes_df <- result_with_properties_df %>%
group_by(ID, Protein.Group, Genes, First.Protein.Description, Occurrence) %>%
summarize(
mean_pI = mean(pI, na.rm = TRUE),
mean_mw = mean(mw, na.rm = TRUE),
mean_gravy = mean(gravy, na.rm = TRUE)
) %>%
ungroup()
## `summarise()` has grouped output by 'ID', 'Protein.Group', 'Genes',
## 'First.Protein.Description'. You can override using the `.groups` argument.
write.csv(attributes_df, file.path(final_out_dir, "attributes_df.csv"), row.names = FALSE)
final_attributes_df <- attributes_df %>%
inner_join(final_df, by = c("ID", "Protein.Group", "Genes", "First.Protein.Description", "Occurrence"))
write.csv(attributes_df, file.path(final_out_dir, "attributes_statistics_df.csv"), row.names = FALSE)
write_xlsx(final_attributes_df, file.path(final_out_dir, "final_attributes_df.xlsx"))
common_columns <- c("ID", "Protein.Group", "Genes", "First.Protein.Description")
merged_attributes_df <- merge(dataset_filtered_in_3_of_4, attributes_df, by = common_columns)
common_columns <- c("ID", "Protein.Group", "Genes", "First.Protein.Description")
merged_attributes_df_repl <- merge(dataset_filtered_in_3_of_4_repl, attributes_df, by = "ID")
# Calculate unique proteins for each method for the two conditions
plot_dataset <- rbind(
dataset_filtered_in_3_of_4_repl[!Method %in% c("1step", "1step+C"), .(Protein_Count = .N, Source = "3 of 4 biol. Replicates"), by = .(Method, Buffer_name, Buffer_type)],
dataset_filtered_in_3_of_4_repl[, .(Protein_Count = uniqueN(ID)), by = Method][
, `:=`(Buffer_name = "All Buffers", Buffer_type = "Total", Source = "3 of 4 biol. Replicates")])
# Create and save the barplot
p <- ggplot(plot_dataset, aes(x = Method, y = Protein_Count, fill = Buffer_type)) +
geom_bar(aes(group = interaction(Buffer_name, Method)), stat = "identity", width = 0.8, position = position_dodge2(width = 0.9, preserve = "single"), alpha = 0.9) +
geom_text(aes(label = Protein_Count, group = interaction(Buffer_name, Method)), position = position_dodge2(width = 0.9, preserve = "single"), vjust = -0.5, hjust = 0.5, size = 3) +
#facet_wrap(~ Method, scales = "free_x", ncol = 2) +
theme_minimal() + # Using a minimal theme for a cleaner look
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Number of Quantified Proteins per Buffer, by Method",
y = "No. of reproducibly\nquantified proteins", x = "Method", fill = "Buffer Type") +
scale_fill_manual(values = fraction_colors) +
ylim(0, 6000)
ggsave(file.path(final_out_dir, "Proteins_per_Buffer_by_Method_with_Summary.png"), plot = p)
## Saving 7 x 5 in image
print(p)
plot_df <- stats_df %>%
mutate(Method_Buffer = paste(Method, Buffer_name, sep = "_")) %>%
mutate(
BufferType = ifelse(grepl("AI", Buffer_name), "AI",
ifelse(grepl("AS", Buffer_name), "AS", "single"))) %>%
arrange(Method, Buffer_name) %>%
mutate(Method_Buffer = factor(Method_Buffer, levels = unique(Method_Buffer)))
# Determine positions for vertical lines to separate the methods
method_positions <- plot_df %>%
group_by(Method) %>%
summarize(position = min(as.numeric(Method_Buffer)) - 0.5) %>%
filter(position > 0)
# Create the violin plot with overlaid boxplot and vertical lines
p2 <- ggplot(plot_df, aes(x = Method_Buffer, y = CVLFQ, fill = BufferType)) +
geom_violin(trim = FALSE) +
geom_boxplot(width = 0.1, fill = "white", outlier.shape = NA) + # Add boxplot
geom_vline(data = method_positions, aes(xintercept = position), linetype = "dotted", color = "black") +
theme_minimal() +
labs(title = "Distribution of CVLFQ by Buffer Name and Method",
x = "Method - Buffer Name",
y = "CVLFQ") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_manual(values = fraction_colors)
ggsave(file.path(final_out_dir, "Distribution_of_CVLFQ_by_Buffer_Name_and_Method.png"), plot = p2)
## Saving 7 x 5 in image
print(p2)
plot_df <- stats_df %>%
mutate(Method_Buffer = paste(Method, Buffer_name, sep = "_")) %>%
mutate(
Buffer_type = ifelse(grepl("AI", Buffer_name), "AI",
ifelse(grepl("AS", Buffer_name), "AS", "single"))) %>%
arrange(Method, Buffer_name) %>%
mutate(Method_Buffer = factor(Method_Buffer, levels = unique(Method_Buffer)))
# Determine positions for vertical lines to separate the methods
method_positions <- plot_df %>%
group_by(Method) %>%
summarize(position = min(as.numeric(Method_Buffer)) - 0.5) %>%
filter(position > 0)
# Create the violin plot with overlaid boxplot and vertical lines
p2 <- ggplot(plot_df, aes(x = Method_Buffer, y = CVLFQ, fill = Buffer_type)) +
geom_violin(trim = FALSE) +
geom_boxplot(width = 0.1, fill = "white", outlier.shape = NA) + # Add boxplot
geom_vline(data = method_positions, aes(xintercept = position), linetype = "dotted", color = "black") +
theme_minimal() +
labs(title = "Distribution of CVLFQ by Buffer Name and Method",
x = "Method - Buffer Name",
y = "CVLFQ", fill = "Buffer Type") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_manual(values = fraction_colors)
ggsave(file.path(final_out_dir, "Distribution_of_CVLFQ_by_Buffer_Name_and_Method.png"), plot = p2)
## Saving 7 x 5 in image
print(p2)
combined_plot <- p / p2 + plot_layout(ncol = 1) + plot_annotation(tag_levels = 'A') +
plot_layout(guides = "collect") &
theme(legend.position = "bottom")
# Save the combined figure
ggsave(file.path(final_out_dir, "Method_Comparison.png"), plot = combined_plot)
## Saving 7 x 5 in image
print(combined_plot)
# Create binary presence/absence matrix per method
binary_presence_method <- dataset_filtered_in_3_of_4 %>%
distinct(ID, Method) %>%
pivot_wider(names_from = Method, values_from = Method,
values_fill = list(Method = 0),
values_fn = list(Method = function(x) 1)) %>%
tibble::column_to_rownames(var = "ID")
# Add an 'all' column to highlight full overlaps
binary_presence_method$all <- 1
# Optional: Reorder columns (if specific order is desired)
#binary_presence_method <- binary_presence_method[, c(setdiff(names(binary_presence_method), "all"), "all")]
binary_presence_method <- binary_presence_method[, c(5,4,3,2,1,6)]
# Identify method columns (excluding 'all')
method_columns <- setdiff(names(binary_presence_method), "all")
# Query: full intersection of all sets (including "all")
full_intersection_query <- ComplexUpset::upset_query(
intersect = names(binary_presence_method),
color = 'orange',
fill = 'orange',
only_components = c('intersections_matrix')
)
# Queries: exact 2-way intersections ("all" + one method)
pairwise_queries <- lapply(method_columns, function(method) {
ComplexUpset::upset_query(
intersect = c("all", method),
color = "steelblue",
fill = "steelblue",
only_components = c('intersections_matrix', 'No. of\nintersecting\nproteins')
)
})
# Query: highlight just the "all" set bar (not intersections)
all_set_query <- ComplexUpset::upset_query(
set = "all",
fill = "orange"
)
# Combine all queries
query_list <- c(list(full_intersection_query), pairwise_queries, list(all_set_query))
# Generate the plot
upset_plot_method <- ComplexUpset::upset(
data = binary_presence_method,
intersect = names(binary_presence_method),
name = "",
wrap = TRUE,
themes = upset_modify_themes(list(
'intersections_matrix' = theme(axis.text.y = element_text(size = 13))
)),
base_annotations = list(
'No. of\nintersecting\nproteins' = intersection_size(
bar_number_threshold = 1,
text=list(
vjust=-0.1,
hjust=0.1,
angle=45
)) +
expand_limits(y = 2000) +
theme(
panel.grid.minor = element_blank(),
axis.text.y = element_text(size = 13),
axis.title.y = element_text(size = 13)
)
),
set_sizes = (
upset_set_size()
+ geom_text(aes(label = ..count..), hjust = 1.1, stat = 'count', size = 4)
+ annotate(geom = 'text', label = '@', x = 'Count', y = 850, color = 'white', size = 10)
+ expand_limits(y = 7000)
+ theme(axis.title.x = element_text(size = 13), axis.text.x = element_text(size = 10))
),
sort_sets = FALSE,
queries = query_list,
sort_intersections_by = 'ratio'
) + theme(text = element_text(size = 13), axis.title.x = element_text(size = 13))
# Save the plot
#ggsave(file.path(final_out_dir, "Method_Intersection_with_Highlighted_Exact2.png"), plot = upset_plot_method, width = 10, height = 6, bg = 'white')
# Show the plot
print(upset_plot_method)
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## ℹ The deprecated feature was likely used in the ComplexUpset package.
## Please report the issue at
## <https://github.com/krassowski/complex-upset/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
# Load necessary libraries
library(ggplot2)
library(patchwork)
library(dplyr)
library(cowplot)
##
## Attaching package: 'cowplot'
## The following object is masked from 'package:patchwork':
##
## align_plots
# Create Method_Buffer column for the first plot dataset
plot_dataset <- rbind(
dataset_filtered_in_3_of_4_repl[, .(Protein_Count = .N, Source = "3 of 4 biol. Replicates"), by = .(Method, Buffer_name, Buffer_type)],
dataset_filtered_in_3_of_4_repl[, .(Protein_Count = uniqueN(ID)), by = Method][
, `:=`(Buffer_name = "Total", Buffer_type = "Total", Source = "3 of 4 biol. Replicates")]
) %>%
mutate(Method_Buffer = paste(Method, Buffer_name, sep = "_")) %>%
arrange(Method, Buffer_name) %>%
mutate(Method_Buffer = factor(Method_Buffer, levels = unique(Method_Buffer)))
# Combine all unique Buffer_name levels
combined_levels <- unique(plot_dataset$Buffer_name)
# Create the mapping for x-axis labels using Buffer_name
buffer_name_labels <- plot_dataset %>%
select(Method_Buffer, Buffer_name) %>%
distinct() %>%
arrange(Method_Buffer)
buffer_name_labels$Buffer_name <- as.character(buffer_name_labels$Buffer_name)
named_labels <- setNames(buffer_name_labels$Buffer_name, buffer_name_labels$Method_Buffer)
named_labels[named_labels == "single"] <- "Single"
plot_dataset$Buffer_type <- factor(plot_dataset$Buffer_type, levels = c("single", "AS", "AI", "Total"))
plot_dataset_filtered <- plot_dataset %>% filter(Buffer_type != "single")
plot_dataset_filtered$Method_Buffer <- factor(plot_dataset_filtered$Method_Buffer,
levels = setdiff(levels(plot_dataset_filtered$Method_Buffer),
c("1step_single", "1step+C_single")))
# Determine positions for vertical lines to separate the methods
method_positions <- plot_dataset_filtered %>%
group_by(Method) %>%
summarize(position = min(as.numeric(Method_Buffer)) - 0.5) %>%
filter(position > 0)
# Calculate midpoints for each Method to place the Method label centered above the group
method_labels <- plot_dataset_filtered %>%
group_by(Method) %>%
summarize(
midpoint = mean(as.numeric(Method_Buffer))) %>%
ungroup()
# Create the first plot (barplot)
p1 <- ggplot(plot_dataset_filtered, aes(x = Method_Buffer, y = Protein_Count, fill = Buffer_type)) +
geom_bar(aes(group = interaction(Buffer_name, Method)), stat = "identity", width = 0.8, position = position_dodge2(width = 0.9, preserve = "single"), color = "black") +
geom_text(aes(label = Protein_Count, group = interaction(Buffer_name, Method)), position = position_dodge2(width = 0.9, preserve = "single"), vjust = -0.5, hjust = 0.5, size = 4) +
geom_text(data = method_labels, aes(x = midpoint, y = 6000, label = Method),
size = 4, vjust = -0.5, color = "black", inherit.aes = FALSE) +
theme_minimal() +
theme(
panel.grid = element_blank(), # Remove grid
panel.background = element_blank(), # Remove background
axis.title.y = element_text(size = 16), # Change y-axis label size
axis.text.y = element_text(size = 12), # Change y-axis tick label size
axis.text.x = element_text(size = 12, angle = 0), # Change x-axis tick label size
strip.text.x = element_text(size = 14),
panel.spacing = unit(0.1, "lines"),
panel.border = element_rect(color = "white", fill = NA, size = 1),
legend.position = "none"
) +
labs(
title = "",
y = "No. of reproducibly\nquantified proteins",
x = "",
fill = "Buffer Type"
) +
scale_fill_manual(
values = setNames(adjustcolor(fraction_colors, alpha.f = 0.7), names(fraction_colors)),
labels = c("single" = "Single", "AS" = "AS", "AI" = "AI", "Total" = "Total")
) +
scale_x_discrete(labels = named_labels) +
ylim(0, 6500) +
geom_vline(data = method_positions, aes(xintercept = position), linetype = "dotted", color = "black", size = 0.8) +
geom_hline(yintercept = 0, linetype = "solid", color = "black", size = 0.5)
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Create the second plot (violin plot with boxplot)
plot_df <- stats_df %>%
mutate(Method_Buffer = paste(Method, Buffer_name, sep = "_")) %>%
mutate(
Buffer_type = factor(ifelse(grepl("AI", Buffer_name), "AI",
ifelse(grepl("AS", Buffer_name), "AS", "single")),
levels = c("single", "AS", "AI", "Total"))) %>%
arrange(Method, Buffer_name) %>%
mutate(Method_Buffer = factor(Method_Buffer, levels = unique(Method_Buffer)))
# Create the mapping for x-axis labels using Buffer_name
buffer_name_labels <- plot_df %>%
select(Method_Buffer, Buffer_name) %>%
distinct() %>%
arrange(Method_Buffer)
buffer_name_labels$Buffer_name <- as.character(buffer_name_labels$Buffer_name)
named_labels <- setNames(buffer_name_labels$Buffer_name, buffer_name_labels$Method_Buffer)
named_labels[named_labels == "single"] <- "Single"
# Determine positions for vertical lines to separate the methods
method_positions <- plot_df %>%
group_by(Method) %>%
summarize(position = min(as.numeric(Method_Buffer)) - 0.5) %>%
filter(position > 0)
# Calculate midpoints for each Method to place the Method label centered above the group
method_labels <- plot_df %>%
group_by(Method) %>%
summarize(
midpoint = mean(as.numeric(Method_Buffer))) %>%
ungroup()
p2 <- ggplot(plot_df, aes(x = Method_Buffer, y = CVLFQ, fill = Buffer_type)) +
geom_violin(trim = FALSE) +
geom_boxplot(width = 0.1, fill = "white", outlier.shape = NA) +
theme_minimal() +
geom_text(data = method_labels, aes(x = midpoint, y = 220, label = Method),
size = 4, vjust = -0.5, color = "black", inherit.aes = FALSE) +
labs(title = "",
x = "",
y = "Coefficient of variation\nin % (maxLfQ)", fill = "Buffer Type") +
theme(panel.grid = element_blank(), panel.background = element_blank(),
axis.title.y = element_text(size = 16), # Change y-axis label size
axis.text.y = element_text(size = 12), # Change y-axis tick label size
axis.text.x = element_text(size = 12, angle = 0), # Change x-axis tick label size
strip.text.x = element_text(size = 14),
panel.spacing = unit(0.1, "lines"), panel.border = element_rect(color = "white", fill = NA, size = 1), legend.position = "nobe") +
scale_fill_manual(values = setNames(adjustcolor(fraction_colors, alpha.f = 0.8), names(fraction_colors))) +
scale_x_discrete(labels = named_labels) +
ylim(0, 240) +
geom_vline(data = method_positions, aes(xintercept = position), linetype = "dotted", color = "black") +
geom_hline(yintercept = 0, linetype = "solid", color = "black", size = 0.5)
# Create a data frame just for the legend
legend_data <- data.frame(
Buffer_type = factor(c("single", "AS", "AI", "Total"), levels = c("single", "AS", "AI", "Total")),
x = 1:4, # Arbitrary values for plotting
y = 1 # Arbitrary values for plotting
)
# Generate the custom legend plot
legend_plot <- ggplot(legend_data, aes(x = x, y = y, fill = Buffer_type)) +
geom_bar(stat = "identity", color = "black") +
scale_fill_manual(values = setNames(adjustcolor(fraction_colors, alpha.f = 0.8), names(fraction_colors)),
labels = c("Single", "AS", "AI", "Total")) +
theme_void() +
theme(legend.position = "bottom") +
labs(fill = "Buffer Type ")
legend_grob <- legend_plot + theme(legend.position = "bottom") + guides(fill = guide_legend(nrow = 1))
combined_plot <-((p1 / p2) +
plot_layout(heights = c(1, 1)) +
plot_annotation(tag_levels = 'A')) /
plot_grid(legend_grob, ncol = 1) +
plot_layout(heights = c(1, 1, 0.1))
# Save the combined plot
ggsave(file.path(final_out_dir, "Combined_Proteins_and_CVLFQ_Distribution.png"), plot = combined_plot,
width = 1148, height = 729, units = "px", dpi = 100)
## Warning: Removed 289 rows containing missing values or values outside the scale range
## (`geom_violin()`).
# Print the combined plot
print(combined_plot)
## Warning: Removed 289 rows containing missing values or values outside the scale range
## (`geom_violin()`).
legend_grob <- legend_plot + theme(legend.position = "bottom") + guides(fill = guide_legend(nrow = 1))
combined_plot <-((p1 / p2) +
plot_layout(heights = c(1, 1)) +
plot_annotation(tag_levels = 'A')) /
plot_grid(legend_grob, ncol = 1) +
plot_layout(heights = c(1, 1, 0.01, 2)) +
upset_plot_method
# Save final output
ggsave(file.path(final_out_dir, "Figure2.png"),plot = combined_plot,width = 1148, height = 1000, units = "px", dpi = 100)
## Warning: Removed 289 rows containing missing values or values outside the scale range
## (`geom_violin()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
# Print to screen
print(combined_plot)
## Warning: Removed 289 rows containing missing values or values outside the scale range
## (`geom_violin()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
# Calculate the median CVLFQ for each Method_Buffer group
median_values <- plot_df %>%
group_by(Method_Buffer) %>%
summarize(median_CVLFQ = median(CVLFQ, na.rm = TRUE))
# Display the median values
print(median_values)
## # A tibble: 10 × 2
## Method_Buffer median_CVLFQ
## <fct> <dbl>
## 1 1step_single 52.6
## 2 1step+C_single 29.6
## 3 2step_AS 37.0
## 4 2step_AI 33.0
## 5 2step+M_AS 31.4
## 6 2step+M_AI 18.8
## 7 4step_AS1 39.9
## 8 4step_AI1 35.8
## 9 4step_AI2 23.8
## 10 4step_AS2 39.2
In this analysis, we aim to compare the performance of two methods, 2step and 2step+, across two buffer types: AS and AI. The objective is to evaluate the differences in identified IDs and assess the intersection and unique contributions of each combination. The data has been processed to identify unique IDs across different buffer-method combinations, and several visualizations have been prepared to provide insights.
We first extract the unique IDs for each buffer and method combination from the filtered dataset. The goal is to analyze the sets of IDs and their overlaps using various visualization techniques.
# Extract unique IDs for each buffer and method combination
buffer_groups <- dataset_filtered_in_3_of_4_repl[, .N, by = .(ID, Buffer_name, Method)]
sets <- list(
`2step+M AI` = buffer_groups[Buffer_name == "AI" & Method == "2step+M", ID],
`2step+M AS` = buffer_groups[Buffer_name == "AS" & Method == "2step+M", ID],
`2step AI` = buffer_groups[Buffer_name == "AI" & Method == "2step", ID],
`2step AS` = buffer_groups[Buffer_name == "AS" & Method == "2step", ID]
)
The Upset plot is created to visualize the intersection of IDs among the different buffer-method combinations. It helps in identifying unique and shared IDs across the four sets:
The plot highlights the overlaps and exclusive IDs, providing a clear picture of the distribution of IDs across the different methods and buffers.
library("ComplexUpset")
library("ggplot2")
# Create a data frame to represent the presence/absence of each ID in the sets
all_ids <- unique(unlist(sets))
upset_data <- as.data.frame(sapply(sets, function(x) as.integer(all_ids %in% x)))
upset_data$ID <- all_ids
# Convert to long format for ComplexUpset
upset_data_long <- reshape2::melt(upset_data, id.vars = "ID", variable.name = "Set", value.name = "Presence")
upset_data_long <- upset_data_long[upset_data_long$Presence == 1, ]
# Define the colors for the sets
set_colors <- c("royalblue", "palegreen3", "royalblue", "palegreen3")
names(set_colors) <- names(sets)
# Create queries based on set colors
queries <- lapply(names(set_colors), function(set) {
upset_query(set = set, fill = set_colors[set])
})
# Create the UpSet plot using ComplexUpset
upset_plot <- ComplexUpset::upset(
upset_data,
names(sets),
name = "",
sort_sets = F,
queries = queries,
set_sizes=(upset_set_size() + geom_text(aes(label=..count..), hjust=1.1, stat='count') + expand_limits(y=5000))
) + ggtitle("Comparison of AS and AI Buffers Across 2step and 2step+ Methods")
# Save the plot
ggsave(
filename = file.path(final_out_dir, "Comparison_of_AS_and_AI_Buffers_Across_2step_and_2step_plus_Methods.png"),
plot = upset_plot,
width = 10,
height = 6,
bg = 'white'
)
# Print the plot to the console
print(upset_plot)
To further explore the intersections, Venn diagrams are generated. We create Venn diagrams for combined buffer-method sets as well as separately for 2step and 2step+ methods:
library(ggvenn)
## Loading required package: grid
##
## Attaching package: 'grid'
## The following object is masked from 'package:Biostrings':
##
## pattern
# Function to create and save Venn diagram
create_and_save_venn <- function(sets_list, title, sub_dir, filename, colors) {
mapped_colors <- sapply(names(sets_list), function(x) {
set_name <- strsplit(x, " ")[[1]][2] # Split by space and take the second part
colors[set_name] # Use the second part to get the color
}, USE.NAMES = TRUE)
names(mapped_colors) <- sapply(names(mapped_colors), function(x) {strsplit(x, "\\.")[[1]][1]})
venn_plot <- ggvenn(
data = sets_list,
fill_color = unname(mapped_colors),
fill_alpha = 1,
stroke_size = 0.5,
set_name_size = 5,
text_size = 4
) + ggtitle(title) + coord_fixed(expand = FALSE, xlim = c(-2.2, 2.2), ylim = c(-2, 1.7))
# Save the Venn diagram
ggsave(file.path(sub_dir, filename), plot = venn_plot, width = 7.5, height = 6, bg = 'white')
return(venn_plot)
}
venn_combined <- create_and_save_venn(
sets_list = list(`2step AS` = sets$`2step AS`, `2step+ AS` = sets$`2step+ AS`, `2step AI` = sets$`2step AI`, `2step+ AI` = sets$`2step+ AI`),
title = "Intersection of AS and AI Buffers Across 2step and 2step+ Methods",
sub_dir = final_out_dir,
filename = "VennDiagram_Combined.png",
colors = fraction_colors
)
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
# Create and save Venn diagram for "2step"
venn1 <- create_and_save_venn(
sets_list = list(`2step AS` = sets$`2step AS`, `2step AI` = sets$`2step AI`),
title = "Intersection of AS and AI in 2step Method",
sub_dir = final_out_dir,
filename = "VennDiagram_2step.png",
colors = fraction_colors
)
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
# Create and save Venn diagram for "2step+"
venn2 <- create_and_save_venn(
sets_list = list(`2step+ AS` = sets$`2step+ AS`, `2step+ AI` = sets$`2step+ AI`),
title = "Intersection of AS and AI in 2step+ Method",
sub_dir = final_out_dir,
filename = "VennDiagram_2step_plus.png",
colors = fraction_colors
)
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
library(VennDiagram)
## Loading required package: futile.logger
library(ggplotify)
library(grid)
# Function to create and save Venn diagram
create_and_save_venn <- function(ai_ids, as_ids, method_label, sub_dir, filename) {
# Suppress warnings temporarily
oldw <- getOption("warn")
options(warn = -1)
venn.plot <- venn.diagram(
x = list(
AS = unique(na.omit(as_ids)),
AI = unique(na.omit(ai_ids))
),
category.names = c("AS", "AI"),
filename = NULL,
fill = c("palegreen3", "royalblue"),
alpha = 0.5,
cex = 2,
cat.cex = 2,
cat.pos = c(20, -30),
cat.dist = 0.05,
cat.col = c("palegreen3", "royalblue"),
margin = 0.1,
inverted = T
)
# Save the Venn diagram
png(filename = file.path(sub_dir, filename), width = 800, height = 600)
grid.newpage()
grid.draw(venn.plot)
grid.text(paste("Intersection of AS and AI in", method_label, "Method"), x = 0.5, y = 0.95, gp = gpar(fontsize = 16, fontface = "bold"))
dev.off()
# Restore warning settings
options(warn = oldw)
return(venn.plot)
}
# Create and save Venn diagram for "2step"
venn1 <- create_and_save_venn(sets$`2step AI`, sets$`2step AS`, "2step", final_out_dir, "VennDiagram_2step.png")
venn1.ggplot <- as.ggplot(~grid.draw(venn1)) #+ ggtitle("Intersection of AS and AI in 2step Method")
# Create and save Venn diagram for "2step+"
venn2 <- create_and_save_venn(sets$`2step+M AI`, sets$`2step+M AS`, "2step+M", final_out_dir, "VennDiagram_2step_plus.png")
venn2.ggplot <- as.ggplot(~grid.draw(venn2))# + ggtitle("Intersection of AS and AI in 2step+ Method")
A split violin plot is created to visualize the distribution of log-transformed Mean LFQ values by buffer type, separated by method (2step vs 2step+). This plot helps in comparing the intensity distributions between the two methods for each buffer type.
GeomSplitViolin <- ggproto("GeomSplitViolin", GeomViolin,
draw_group = function(self, data, ..., draw_quantiles = NULL) {
data <- transform(data, xminv = x - violinwidth * (x - xmin), xmaxv = x + violinwidth * (xmax - x))
grp <- data[1, "group"]
newdata <- plyr::arrange(transform(data, x = if (grp %% 2 == 1) xminv else xmaxv), if (grp %% 2 == 1) y else -y)
newdata <- rbind(newdata[1, ], newdata, newdata[nrow(newdata), ], newdata[1, ])
newdata[c(1, nrow(newdata) - 1, nrow(newdata)), "x"] <- round(newdata[1, "x"])
if (length(draw_quantiles) > 0 & !scales::zero_range(range(data$y))) {
stopifnot(all(draw_quantiles >= 0), all(draw_quantiles <=
1))
quantiles <- ggplot2:::create_quantile_segment_frame(data, draw_quantiles)
aesthetics <- data[rep(1, nrow(quantiles)), setdiff(names(data), c("x", "y")), drop = FALSE]
aesthetics$alpha <- rep(1, nrow(quantiles))
both <- cbind(quantiles, aesthetics)
quantile_grob <- GeomPath$draw_panel(both, ...)
ggplot2:::ggname("geom_split_violin", grid::grobTree(GeomPolygon$draw_panel(newdata, ...), quantile_grob))
}
else {
ggplot2:::ggname("geom_split_violin", GeomPolygon$draw_panel(newdata, ...))
}
})
geom_split_violin <- function(mapping = NULL, data = NULL, stat = "ydensity", position = "identity", ...,
draw_quantiles = NULL, trim = TRUE, scale = "area", na.rm = FALSE,
show.legend = NA, inherit.aes = TRUE) {
layer(data = data, mapping = mapping, stat = stat, geom = GeomSplitViolin,
position = position, show.legend = show.legend, inherit.aes = inherit.aes,
params = list(trim = trim, scale = scale, draw_quantiles = draw_quantiles, na.rm = na.rm, ...))
}
plot_df <- stats_df[stats_df$Method %in% c("2step", "2step+M"),]
# Create the split violin plot with log-transformed MeanLFQ
violin <- ggplot(plot_df, aes(x = Method, y = log10(MeanLFQ), fill = Buffer_name)) +
geom_split_violin() +
labs(title = "", x = "Method", y = "loh10 maxLfQ intensities") +
theme_minimal() +
scale_fill_manual(values = setNames(adjustcolor(fraction_colors, alpha.f = 0.7), names(fraction_colors)))
# scale_fill_manual(values = c("royalblue", "palegreen3"))
ggsave(file.path(final_out_dir, "Violin_BufferTypes.png"), plot = p, width = 10, height = 6, bg = 'white')
Finally, the individual plots are combined into a single figure for a comprehensive visual summary of the analysis. The combined plot includes the Venn diagrams, Upset plot, and violin plot.
# Combining the plots
combined_plot <- (venn1.ggplot | venn2.ggplot | violin) / (upset_plot) + plot_annotation(tag_levels = 'A')
# Save the combined figure
ggsave(file.path(final_out_dir, "2step_vs_2stepplus.png"), plot = combined_plot, width = 12, height = 6, dpi = 100)
print(combined_plot)
library(VennDiagram)
library(ggplotify)
library(grid)
# Function to create and save Venn diagram
create_and_save_venn <- function(ai_ids, as_ids, method_label) {
# Suppress warnings temporarily
oldw <- getOption("warn")
options(warn = -1)
venn.plot <- venn.diagram(
x = list(
AS = unique(as_ids),
AI = unique(ai_ids)
),
category.names = c("AS", "AI"),
filename = NULL,
fill = c("palegreen3", "royalblue"),
alpha = 0.7,
cex = 1.2,
fontfamily = "sans", # Set font to match other plots
cat.cex = 1.2,
cat.pos = c(20, -30),
cat.dist = 0.05,
cat.col = c("palegreen3", "royalblue"),
cat.fontfamily = "sans",
margin = 0.01,
inverted = T
)
# Restore warning settings
options(warn = oldw)
return(venn.plot)
}
# Create and save Venn diagram for "2step"
venn1 <- create_and_save_venn(sets$`2step AI`, sets$`2step AS`, "2step")
venn1.ggplot <- as.ggplot(~grid.draw(venn1)) #+ ggtitle("Intersection of AS and AI in 2step Method")
# Create and save Venn diagram for "2step+"
venn2 <- create_and_save_venn(sets$`2step+M AI`, sets$`2step+M AS`, "2step+")
venn2.ggplot <- as.ggplot(~grid.draw(venn2))# + ggtitle("Intersection of AS and AI in 2step+ Method")
plot_df <- stats_df[stats_df$Method %in% c("2step", "2step+M"),]
breaks <- 10^(-10:10)
minor_breaks <- rep(1:9, 21)*(10^rep(-10:10, each=9))
# Create the split violin plot with log-transformed MeanLFQ
violin <- ggplot(plot_df, aes(x = Method, y = MeanLFQ, fill = Buffer_name)) +
geom_split_violin() +
labs(title = "", x = "", y = "log10 maxLfQ intensities", fill = "Buffer Type") +
theme_minimal() +
scale_fill_manual(values = setNames(adjustcolor(fraction_colors, alpha.f = 0.7), names(fraction_colors))) +
theme(
text = element_text(size = 14),
axis.text.x = element_text(size = 14), # Larger x-axis labels
axis.text.y = element_text(size = 10), # Log10 scaling applied
panel.grid.minor.y = element_line(color="grey", linetype="dotted"),
panel.grid.major.y = element_line(color="grey", linetype="solid")
) + scale_y_log10(breaks = breaks, minor_breaks = minor_breaks, labels = label_number()) +
annotation_logticks(sides="l")
library("ComplexUpset")
library("ggplot2")
# Create a data frame to represent the presence/absence of each ID in the sets
all_ids <- unique(unlist(sets))
upset_data <- as.data.frame(sapply(sets, function(x) as.integer(all_ids %in% x)))
upset_data$ID <- all_ids
# Convert to long format for ComplexUpset
upset_data_long <- reshape2::melt(upset_data, id.vars = "ID", variable.name = "Set", value.name = "Presence")
upset_data_long <- upset_data_long[upset_data_long$Presence == 1, ]
# Define the colors for the sets
set_colors <- c("royalblue", "palegreen3", "royalblue", "palegreen3")
names(set_colors) <- names(sets)
# Create queries based on set colors
queries <- lapply(names(set_colors), function(set) {
upset_query(set = set, fill = set_colors[set])
})
# Create the UpSet plot using ComplexUpset
upset_plot <- ComplexUpset::upset(
upset_data,
names(sets),
name = "",
sort_sets = F,
base_annotations=list(
'No. of\nintersecting\nproteins'=intersection_size(counts=TRUE) +
theme(axis.text.y = element_text(size = 12), # Increase y-axis annotation size
axis.title.y = element_text(size = 13)) # Increase y-axis title size
),
queries = queries,
set_sizes=(upset_set_size() + geom_text(aes(label=..count..), hjust=1.1, stat='count', size = 4) + expand_limits(y=5000) +
theme(
axis.title.x = element_text(size = 13), # Set size X-axis title size
axis.text.x = element_text(size = 12) # X-axis tick labels size
))
) + theme(text = element_text(size = 14),
axis.title.x = element_text(size = 12))
# Save the plot
ggsave(
filename = file.path(final_out_dir, "Comparison_of_AS_and_AI_Buffers_Across_2step_and_2step_plus_Methods.png"),
plot = upset_plot,
width = 10,
height = 6,
bg = 'white'
)
# Print the plot to the console
print(upset_plot)
# Combining the plots
combined_plotx <- (venn1.ggplot | venn2.ggplot | violin) / (as.ggplot(upset_plot)) + plot_annotation(tag_levels = 'A') +
plot_layout(guides = "collect") & theme(legend.position = "bottom")
# Save the combined figure
ggsave(file.path(final_out_dir, "2step_vs_2stepplus.png"), plot = combined_plotx, width = 12, height = 8, dpi = 200)
print(combined_plotx)
combined_df <- stats_df %>% arrange(Protein.Group, Genes, Method, Buffer_name)
filtered_df <- combined_df %>%
filter(Method %in% c("2step", "2step+M")) %>%
mutate(Protein.ID = Protein.Group) %>%
separate_rows(Protein.ID, sep = ";") %>%
left_join(result_with_properties_df, by=c("ID","Protein.Group","Genes","Protein.ID"))
# Identify IDs uniquely found in either "2step" or "2step+" methods or shared
unique_ids <- filtered_df %>%
group_by(ID) %>%
summarize(Methods = n_distinct(Method)) %>%
ungroup()
# Join the unique_ids back to the filtered_df to retain the Method column
unique_ids <- filtered_df %>%
left_join(unique_ids, by = "ID")
# Classify IDs
filtered_df <- unique_ids %>%
mutate(Category = ifelse(Methods == 1,
ifelse(Method == "2step", "Only 2step",
ifelse(Method == "2step+M", "Only 2step+M", NA)),
ifelse(Methods > 1, "Shared", NA)))
# Categorize by buffer type
buffer_categories <- filtered_df %>%
group_by(ID, Method) %>%
summarize(Buffers = n_distinct(Buffer_name)) %>%
ungroup()
## `summarise()` has grouped output by 'ID'. You can override using the `.groups`
## argument.
buffer_categories <- filtered_df %>%
left_join(buffer_categories, by = c("ID", "Method"))
filtered_df <- buffer_categories %>%
mutate(Buffer_Category = ifelse(Buffers == 1,
ifelse(Buffer_name == "AI", "Only AI",
ifelse(Buffer_name == "AS", "Only AS", NA)),
ifelse(Buffers > 1, "Both", NA)))
# Summarize data for plotting
plot_data <- filtered_df %>%
group_by(Category, Buffer_Category) %>%
summarize(Count = n_distinct(Protein.ID))
## `summarise()` has grouped output by 'Category'. You can override using the
## `.groups` argument.
# Filter the original filtered_df based on unique IDs
filtered_df2 <- filtered_df %>%
filter(Category != "Shared")
melted_df <- melt(filtered_df2, id.vars = c("ID", "Protein.Group", "Genes", "Method", "Buffer_name"),
measure.vars = c("pI", "mw", "gravy"),
variable.name = "Property",
value.name = "Value")
## Warning: The melt generic in data.table has been passed a tbl_df and will
## attempt to redirect to the relevant reshape2 method; please note that reshape2
## is superseded and is no longer actively developed, and this redirection is now
## deprecated. To continue using melt methods from reshape2 while both libraries
## are attached, e.g. melt.list, you can prepend the namespace, i.e.
## reshape2::melt(filtered_df2). In the next version, this warning will become an
## error.
# Function to plot each property
plot_property <- function(df, property, log_scale = FALSE) {
p <- ggplot(df %>% filter(Property == property), aes(x = Buffer_name, y = Value, fill = Method)) +
geom_split_violin() +
theme_minimal() +
scale_fill_manual(values = c("#E69F00", "#56B4E9"))
if (log_scale) {
p <- p + scale_y_log10() +
labs(title = paste("Log10", property, "by Buffer Type"), x = "Buffer Type", y = property) +
theme(
text = element_text(size = 14),
axis.text.x = element_text(size = 16), # Larger x-axis labels
axis.text.y = element_text(size = 10), # Log10 scaling applied
panel.grid.minor.y = element_line(color="grey", linetype="dotted"),
panel.grid.major.y = element_line(color="grey", linetype="solid")
) + scale_y_log10(breaks = breaks, minor_breaks = minor_breaks, labels = label_number()) +
annotation_logticks(sides="l")
} else {
p <- p + labs(title = paste(property, "by Buffer Type"), x = "Buffer Type", y = property)
}
return(p)
}
# Plot for pI
plot_pI <- plot_property(melted_df, "pI")
print(plot_pI)
## Warning: Removed 19 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
# Plot for mw with logarithmic scale
plot_mw <- plot_property(melted_df, "mw", log_scale = TRUE)
## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.
print(plot_mw)
## Warning: Removed 19 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
# Plot for gravy
plot_gravy <- plot_property(melted_df, "gravy")
print(plot_gravy)
## Warning: Removed 19 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
library(scales)
plot_df <- merged_attributes_df_repl[merged_attributes_df_repl$Method %in% c("2step", "2step+M")]
# Function to plot each property
plot_property <- function(df, property, ylab, log_scale = FALSE) {
p <- ggplot(df, aes(x = Buffer_name, y = get(property), fill = Method)) +
geom_split_violin() +
theme_minimal() +
scale_fill_manual(values = c("#E69F00", "#56B4E9")) +
theme(axis.title.x = element_blank(), legend.position = "none") +
theme(text = element_text(size = 14))
if (log_scale) {
p <- p + scale_y_log10(labels = label_number()) +
labs(title = "", x = "Buffer Type", y = ylab) +
theme(
text = element_text(size = 14),
axis.text.x = element_text(size = 16), # Larger x-axis labels
axis.text.y = element_text(size = 10), # Log10 scaling applied
panel.grid.minor.y = element_line(color="grey", linetype="dotted"),
panel.grid.major.y = element_line(color="grey", linetype="solid")
) + scale_y_log10(breaks = breaks, minor_breaks = minor_breaks, labels = label_number()) +
annotation_logticks(sides="l")
} else {
p <- p + labs(title = "", x = "Buffer Type", y = ylab)
}
return(p)
}
plot_df$mean_mw <- plot_df$mean_mw/1000
# Plot for pI
plot_pI <- plot_property(plot_df, "mean_pI", ylab = "Isoelectric point (pI)")
print(plot_pI)
## Warning: Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
# Plot for mw with logarithmic scale
plot_mw <- plot_property(plot_df, "mean_mw", ylab = "Molecular weight [kDa]", log_scale = TRUE)
## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.
print(plot_mw)
## Warning: Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
# Plot for gravy
plot_gravy <- plot_property(plot_df, "mean_gravy", ylab = "GRAVY score")
print(plot_gravy)
## Warning: Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
# Combine the plots using patchwork and add a global title
combined_plot <- (plot_pI | plot_mw | plot_gravy) #+
#plot_layout(guides = "collect") &
#theme(legend.position = "bottom")
# Add overall title
#combined_plot <- combined_plot +
# plot_annotation(title = "Protein Properties by Buffer Type and Method across all Protein Groups", theme = theme(plot.title = element_text(size = 16, face = "bold")))
# Display the combined plot
print(combined_plot)
## Warning: Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
# Filter data for IDs that are either only in AI or AS Buffer_type, but not shared inside the method
filter_ids <- function(dt, method) {
method_dt <- dt[Method == method]
only_ai <- method_dt[Buffer_type == "AI" & !(ID %in% method_dt[Buffer_type == "AS", ID])]
only_as <- method_dt[Buffer_type == "AS" & !(ID %in% method_dt[Buffer_type == "AI", ID])]
rbind(only_ai, only_as)
}
# Function to plot each property
plot_property <- function(df, property, ylab, log_scale = FALSE) {
p <- ggplot(df, aes(x = Buffer_name, y = get(property), fill = Method)) +
geom_split_violin() +
theme_minimal() +
scale_fill_manual(values = c("#E69F00", "#56B4E9")) +
theme(axis.title.x = element_blank(), legend.position = "none") +
theme(text = element_text(size = 14))
if (log_scale) {
p <- p + scale_y_log10(labels = label_number()) +
labs(title = "", x = "Buffer Type", y = ylab) +
theme(
text = element_text(size = 14),
axis.text.x = element_text(size = 14), # Larger x-axis labels
axis.text.y = element_text(size = 10), # Log10 scaling applied
panel.grid.minor.y = element_line(color="grey", linetype="dotted"),
panel.grid.major.y = element_line(color="grey", linetype="solid")
) + scale_y_log10(breaks = breaks, minor_breaks = minor_breaks, labels = label_number()) +
annotation_logticks(sides="l")
} else {
p <- p + labs(title = "", x = "Buffer Type", y = ylab) +
theme(
text = element_text(size = 14),
axis.text.x = element_text(size = 14), # Larger x-axis labels
axis.text.y = element_text(size = 10)
)
}
return(p)
}
filtered_plot_df <- bind_rows(filter_ids(plot_df, "2step"), filter_ids(plot_df, "2step+M"))
# Plot for pI
plot_pI_filtered <- plot_property(filtered_plot_df, "mean_pI", ylab = "Isoelectric point (pI)")
# Plot for mw with logarithmic scale
plot_mw_filtered <- plot_property(filtered_plot_df, "mean_mw", ylab = "Molecular weight [kDa]", log_scale = TRUE)
## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.
breaks <- 10^(-10:10)
minor_breaks <- rep(1:9, 21)*(10^rep(-10:10, each=9))
# Plot for gravy
plot_gravy_filtered <- plot_property(filtered_plot_df, "mean_gravy", ylab = "GRAVY score")
# Combine the plots using patchwork and add a global title
combined_plot_filtered <- (plot_pI_filtered | plot_mw_filtered | plot_gravy_filtered) +
plot_layout(guides = "collect") &
theme(legend.position = "bottom")
# Add overall title
#combined_plot_filtered <- combined_plot_filtered +
# plot_annotation(title = "Protein Properties by Buffer Type and Method across unique Protein Groups per Buffer Type", theme = theme(plot.title = #element_text(size = 16, face = "bold")))
# Display the combined plot
print(combined_plot_filtered)
## Warning: Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
final_combined_plot <- (combined_plot / combined_plot_filtered) + plot_annotation(tag_levels = 'A') +
plot_layout(guides = "collect") & theme(legend.position = "bottom")
# plot_annotation(title = "Comparison of Protein Properties in Full and Filtered Data", theme = theme(plot.title = element_text(size = 18, face = "bold")),
# tag_levels = 'A')
ggsave(file.path(final_out_dir, "Combined_Attributes.png"), plot = final_combined_plot, width = 8, height = 6)
## Warning: Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Warning: Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
# Display the final combined plot
print(final_combined_plot)
## Warning: Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Warning: Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Warning: Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Final
last_combined_plot <- ((venn1.ggplot | venn2.ggplot | violin) / (as.ggplot(upset_plot)) / (plot_pI_filtered | plot_mw_filtered | plot_gravy_filtered)) +
plot_annotation(tag_levels = 'A') +
plot_layout(guides = "collect") &
theme(
legend.position = "bottom",
plot.tag = element_text(size = 14) # Adjust size as needed for tag consistency
)
ggsave(file.path(final_out_dir, "Combined_2step2step+.png"), plot = last_combined_plot, width =14, height = 12, dpi = 300)
## Warning: Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
# Display the final combined plot
print(last_combined_plot)
## Warning: Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
# Calculate the number of unique IDs per method and per buffer type for filtered_plot_df
id_counts_filtered <- filtered_plot_df[, .N, by = .(Method, Buffer_type)]
setnames(id_counts_filtered, "N", "ID_Count")
# Calculate the number of unique IDs per method and per buffer type for plot_df
id_counts_full <- plot_df[, .N, by = .(Method, Buffer_type)]
setnames(id_counts_full, "N", "ID_Count")
# Create bar plot for filtered_plot_df
plot_filtered <- ggplot(id_counts_filtered, aes(x = Method, y = ID_Count, fill = Buffer_type)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.8)) +
geom_text(aes(label = ID_Count), hjust = 1.1, position = position_dodge(width = 0.8)) +
scale_fill_manual(values = fraction_colors) +
labs(title = "",
x = "Method",
y = "Count of Unique IDs") +
theme_minimal() +
coord_flip()
# Create bar plot for plot_df
plot_full <- ggplot(id_counts_full, aes(x = Method, y = ID_Count, fill = Buffer_type)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.8)) +
geom_text(aes(label = ID_Count), hjust = 1.1, position = position_dodge(width = 0.8)) +
scale_fill_manual(values = fraction_colors) +
labs(title = "",
x = "Method",
y = "Count of Unique IDs") +
theme_minimal() +
coord_flip()
# Display the plots
print(plot_filtered)
print(plot_full)
final_combined_plot <- ((combined_plot | plot_full) /
(combined_plot_filtered | plot_filtered)) +
plot_layout(guides = "collect") &
plot_annotation(title = "Comparison of Protein Properties in Full and Filtered Data",
theme = theme(plot.title = element_text(size = 18, face = "bold")),
tag_levels = 'A')
# Display the final combined plot
print(final_combined_plot)
## Warning: Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Warning: Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
dataset_2step_plus <- dataset_merged[Method == "2step+M" & !is.na(maxLfQ_intensities)]
dataset_2step_plus_repl <- dataset_2step_plus[, .(Unique_Repl_Biol_Count = uniqueN(Repl_Biol)), by = .(Method, Buffer_name, Buffer_type, Repl_Techn, ID)]
dataset_2step_plus_in_3_of_4 <- merge(
dataset_2step_plus,
dataset_2step_plus_repl[Unique_Repl_Biol_Count >= 3][, .(Method, Buffer_name, Repl_Techn, ID)],
by = c("Method", "Buffer_name", "Repl_Techn", "ID")
)
dataset_2step_plus_in_3_of_4_all_techn_per_buffer <- dataset_2step_plus_in_3_of_4[, if(uniqueN(Repl_Techn) == 4) .SD, by = .(ID, Buffer_name, Method)]
# counts as == 4 even if in AS == (1,2,3) and in AI == (2,3,4)
dataset_2step_plus_in_3_of_4_all_techn <- dataset_2step_plus_in_3_of_4[, if(uniqueN(Repl_Techn) == 4) .SD, by = .(ID, Method)]
stats_df_bio <- dataset_2step_plus_in_3_of_4 %>%
group_by(ID, Protein.Group, Genes, First.Protein.Description, Method, Buffer_name, Repl_Techn) %>%
summarize(
#MedianLFQ = median(maxLfQ_intensities, na.rm = TRUE),
MeanLFQ = mean(maxLfQ_intensities, na.rm = TRUE),
SDLFQ = sd(maxLfQ_intensities, na.rm = TRUE),
CVLFQ = (sd(maxLfQ_intensities, na.rm = TRUE) / mean(maxLfQ_intensities, na.rm = TRUE)) * 100,
.groups = 'drop') %>%
dplyr::rename(`Replicate` = `Repl_Techn`) %>%
mutate(`Variance_Type` = "Biological Variance")
stats_df_tech <- dataset_2step_plus_in_3_of_4 %>%
group_by(ID, Protein.Group, Genes, First.Protein.Description, Method, Buffer_name, Repl_Biol) %>%
summarize(
# MedianLFQ = median(maxLfQ_intensities, na.rm = TRUE),
MeanLFQ = mean(maxLfQ_intensities, na.rm = TRUE),
SDLFQ = sd(maxLfQ_intensities, na.rm = TRUE),
CVLFQ = (sd(maxLfQ_intensities, na.rm = TRUE) / mean(maxLfQ_intensities, na.rm = TRUE)) * 100,
.groups = 'drop') %>%
dplyr::rename(`Replicate` = `Repl_Biol`) %>%
mutate(`Variance_Type` = "Technical Variance")
combined_variance_stats <- bind_rows(stats_df_tech, stats_df_bio)
# Summarize the dataset
mean_variance_stats <- combined_variance_stats %>%
group_by(ID, Protein.Group, Genes, First.Protein.Description, Method, Buffer_name, Variance_Type) %>%
summarize(
#MedianLFQ = mean(MedianLFQ, na.rm = TRUE),
MeanLFQ = mean(MeanLFQ, na.rm = TRUE),
SDLFQ = mean(SDLFQ, na.rm = TRUE),
CVLFQ = mean(CVLFQ, na.rm = TRUE)
) %>%
ungroup()
## `summarise()` has grouped output by 'ID', 'Protein.Group', 'Genes',
## 'First.Protein.Description', 'Method', 'Buffer_name'. You can override using
## the `.groups` argument.
First, we calculate various statistics for each sample type. This includes median LFQ, mean LFQ, standard deviation, and coefficient of variation. Additionally, we extract the intensities for each biological replicate.
# Calculate statistics for each sample type
stats_df_2plus <- dataset_2step_plus_in_3_of_4 %>%
group_by(ID, Protein.Group, Genes, First.Protein.Description, Method, Buffer_name, Repl_Techn) %>%
summarize(
#MedianLFQ = median(maxLfQ_intensities, na.rm = TRUE),
MeanLFQ = mean(maxLfQ_intensities, na.rm = TRUE),
SDLFQ = sd(maxLfQ_intensities, na.rm = TRUE),
CVLFQ = (sd(maxLfQ_intensities, na.rm = TRUE) / mean(maxLfQ_intensities, na.rm = TRUE)) * 100,
.groups = 'drop')# %>%
#dplyr::rename(`Replicate` = `Repl_Techn`)
# Calculate the intensities for each biological replicate
replicate_intensities <- dataset_2step_plus %>%
group_by(ID, Protein.Group, Genes, First.Protein.Description, M_value, Buffer_name, Repl_Techn, Repl_Biol) %>%
summarize(Intensity = mean(maxLfQ_intensities, na.rm = TRUE), .groups = 'drop') %>%
mutate(Tech = paste0("T", Repl_Techn)) %>%
mutate(B = paste0("B", Repl_Biol)) %>%
unite("T_B_Repl", Tech, Buffer_name, B, sep = "-") %>%
select(-Repl_Biol, -Repl_Techn) %>%
pivot_wider(names_from = T_B_Repl, values_from = Intensity)
# Add a new column M-B to stats_df using the mapping
stats_df_2plus_long <- stats_df_2plus %>%
mutate(T_B = paste0("T", Repl_Techn, "-", Buffer_name)) %>%
select(-Repl_Techn, -Buffer_name, -Method) %>%
pivot_longer(cols = c(MeanLFQ, SDLFQ, CVLFQ), names_to = "Metric", values_to = "Value") %>%
unite("T_B_Metric", T_B, Metric, sep = "_") %>%
pivot_wider(names_from = T_B_Metric, values_from = Value) #%>%
#relocate(Occurrence, .after = First.Protein.Description) # Moves Occurrence to the 5th position
# Merge with replicate intensities and arrange columns
final_df_2plus <- stats_df_2plus_long %>%
right_join(replicate_intensities, by = c("ID", "Protein.Group", "Genes", "First.Protein.Description")) %>%
arrange(ID)
Next, we dynamically generate the column order to ensure the columns are sorted first by replicate and then by metrics for each Method x Buffer combination.
# Filter to only include Repl_Techn == 1
ordered_meta_copy <- ordered_meta_data %>%
#filter(Repl_Techn == 1) %>%
filter(Method == "2step+M") %>%
arrange(Repl_Techn, Buffer_name, Repl_Biol)
# Function to generate columns for each Method x Buffer combination
generate_columns <- function(buffer, t_value) {
replicate_cols <- paste0("T", t_value, "-", buffer, "-B", 1:4)
metric_cols <- paste0("T", t_value, "-", buffer, "_", c("MeanLFQ", "SDLFQ", "CVLFQ"))
c(replicate_cols, metric_cols)
}
method_buffer_combinations <- ordered_meta_copy %>%
distinct(Repl_Techn, Buffer_name, M_value)
# Generate the column order dynamically
all_cols <- unlist(lapply(1:nrow(method_buffer_combinations), function(i) {
generate_columns(method_buffer_combinations$Buffer_name[i],
method_buffer_combinations$Repl_Techn[i])
}))
# Ensure that all_cols contains only the columns present in final_df
all_cols <- c("ID", "Protein.Group", "Genes", "First.Protein.Description", all_cols[all_cols %in% names(final_df_2plus)])
# Select columns in the specified order
final_df_2plus <- final_df_2plus %>%
select(all_of(all_cols))
final_attributes_2plus_df <- attributes_df %>% select(-Occurrence) %>%
inner_join(final_df_2plus, by = c("ID", "Protein.Group", "Genes", "First.Protein.Description"))
write_xlsx(final_attributes_2plus_df, file.path(final_out_dir, "final_2plus_attributes_df.xlsx"))
binary_presence <- dataset_2step_plus_in_3_of_4 %>%
distinct(ID, Repl_Techn) %>%
pivot_wider(names_from = Repl_Techn, values_from = Repl_Techn,
values_fill = list(Repl_Techn = 0),
values_fn = list(Repl_Techn = function(x) 1)) %>%
tibble::column_to_rownames(var = "ID")
binary_presence$all <- 1
binary_presence <- binary_presence[, c(4,3,2,1,5)]
upset_plot <- ComplexUpset::upset(data=binary_presence,
intersect=names(binary_presence), name="",
wrap=TRUE,
themes=upset_modify_themes(list('intersections_matrix'=theme(axis.text.y=element_text(size=12)))),
base_annotations=list(
'No. of\nintersecting\nproteins'=intersection_size(counts=TRUE, bar_number_threshold = 1) +
expand_limits(y = 4000) +
theme(
#panel.grid.major = element_blank(), # Remove major gridlines
panel.grid.minor = element_blank(), # Remove minor gridlines
axis.text.y = element_text(size = 11), # Increase y-axis annotation size
axis.title.y = element_text(size = 13))), # Increase y-axis title size),
set_sizes=(
upset_set_size()
+ geom_text(aes(label=..count..), hjust=1.1, stat='count', size = 3.5)
+ annotate(geom='text', label='@', x='Count', y=850, color='white', size=4)
+ expand_limits(y=7000) +
theme(axis.title.x = element_text(size = 13), axis.text.x = element_text(size = 10))
),
sort_sets=FALSE,
queries=list(
upset_query(
intersect = names(binary_presence),
color='darkred',
fill='darkred',
only_components=c('intersections_matrix', 'No. of\nintersecting\nproteins')
),
upset_query(set = c("all"), fill = "orange")
),
sort_intersections_by='ratio') + theme(text = element_text(size = 14), axis.title.x = element_text(size = 12))
ggsave(file.path(final_out_dir, "Technical_Repl_Intersection.png"), plot = upset_plot, width = 10, height = 6, bg = 'white')
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
print(upset_plot)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
binary_presence <- dataset_2step_plus_in_3_of_4 %>%
filter(Buffer_name == "AS") %>%
distinct(ID, Repl_Techn) %>%
pivot_wider(names_from = Repl_Techn, values_from = Repl_Techn,
values_fill = list(Repl_Techn = 0),
values_fn = list(Repl_Techn = function(x) 1)) %>%
tibble::column_to_rownames(var = "ID")
binary_presence$all <- 1
binary_presence <- binary_presence[, c(4,3,2,1,5)]
upset_plotAS <- ComplexUpset::upset(data=binary_presence,
intersect=names(binary_presence), name="Technical Replicates",
wrap=TRUE,
base_annotations=list(
'Intersection size'=intersection_size(
text=list(size = 2.5))),
set_sizes=(
upset_set_size()
+ geom_text(aes(label=..count..), hjust=1.1, stat='count')
+ annotate(geom='text', label='@', x='Count', y=850, color='white', size=3)
+ expand_limits(y=7000)
),
sort_sets=FALSE,
queries=list(
upset_query(
intersect = names(binary_presence),
color='palegreen3',
fill='palegreen3',
only_components=c('intersections_matrix', 'Intersection size')
),
upset_query(set = c("all"), fill = "orange")
),
sort_intersections_by='ratio') + ggtitle('Intersection of Quantified Proteins in 2Step+ in AS')
#ggsave(file.path(final_out_dir, "technical_repl_intersection.png"), plot = upset_plot, width = 10, height = 6, bg = 'white')
print(upset_plotAS)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
binary_presence <- dataset_2step_plus_in_3_of_4 %>%
filter(Buffer_name == "AI") %>%
distinct(ID, Repl_Techn) %>%
pivot_wider(names_from = Repl_Techn, values_from = Repl_Techn,
values_fill = list(Repl_Techn = 0),
values_fn = list(Repl_Techn = function(x) 1)) %>%
tibble::column_to_rownames(var = "ID")
binary_presence$all <- 1
binary_presence <- binary_presence[, c(4,3,2,1,5)]
upset_plotAI <- ComplexUpset::upset(data=binary_presence,
intersect=names(binary_presence), name="Technical Replicates",
wrap=TRUE,
base_annotations=list(
'Intersection size'=intersection_size(
text=list(size = 2.5))),
set_sizes=(
upset_set_size()
+ geom_text(aes(label=..count..), hjust=1.1, stat='count')
+ annotate(geom='text', label='@', x='Count', y=850, color='white', size=3)
+ expand_limits(y=7000)
),
sort_sets=FALSE,
queries=list(
upset_query(
intersect = names(binary_presence),
color='royalblue',
fill='royalblue',
only_components=c('intersections_matrix', 'Intersection size')
),
upset_query(set = c("all"), fill = "orange")
),
sort_intersections_by='ratio') + ggtitle('Intersection of Quantified Proteins in 2Step+ in AI')
#ggsave(file.path(final_out_dir, "technical_repl_intersection.png"), plot = upset_plot, width = 10, height = 6, bg = 'white')
print(upset_plotAI)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
(upset_plot) / (upset_plotAS | upset_plotAI ) + plot_annotation(tag_levels = 'A')
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
# Define a function to create and save plots
create_and_save_plot <- function(df, y_var, log_transform = FALSE, sub_dir, filename) {
p <- ggplot(df, aes(x = as.factor(Replicate), y = if (log_transform) log10(!!sym(y_var)) else !!sym(y_var), fill = Buffer_name)) +
geom_boxplot() +
facet_grid( Buffer_name ~ Variance_Type, scales = "free_x") +
labs(title = paste("Biological and Technical Variance in", y_var),
x = "Replicate", y = if (log_transform) paste("Log10(", y_var, ")", sep = "") else y_var) +
scale_fill_manual(values = c("AI" = "royalblue", "AS" = "palegreen3")) +
theme_minimal()
# Print the plot
print(p)
# Save the plot
ggsave(file.path(sub_dir, filename), plot = p, width = 12, height = 8, bg = 'white')
return(p)
}
# Plot for MeanLFQ with log10 transformation
meanlfq <- create_and_save_plot(combined_variance_stats, "MeanLFQ", log_transform = TRUE, final_out_dir, "Variance_MeanLFQ_Boxplot.png")
# Plot for SDLFQ with log10 transformation
sdlfq <- create_and_save_plot(combined_variance_stats, "SDLFQ", log_transform = TRUE, final_out_dir, "Variance_SDLFQ_Boxplot.png")
## Warning: Removed 2788 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Removed 2788 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
# Plot for CVLFQ without log10 transformation
cvlfq <- create_and_save_plot(combined_variance_stats, "CVLFQ", log_transform = FALSE, final_out_dir, "Variance_CVLFQ_Boxplot.png")
## Warning: Removed 2788 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Removed 2788 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
library(ggpubr)
##
## Attaching package: 'ggpubr'
## The following object is masked from 'package:VennDiagram':
##
## rotate
## The following object is masked from 'package:cowplot':
##
## get_legend
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:IRanges':
##
## desc
## The following object is masked from 'package:stats':
##
## filter
# Define a function to create and save summary plots
create_and_save_summary_plot <- function(df, y_var, log_transform = FALSE, sub_dir, filename) {
# Modify the variance type for visualization
df_mut <- df %>% mutate(Variance_Type = str_replace_all(Variance_Type, " ", "\n"))
# Perform the t-test for each Buffer_name group
stat.test <- df_mut %>%
group_by(Buffer_name) %>%
t_test(CVLFQ ~ Variance_Type) %>%
add_significance() # %>%
# mutate(
# effsize = round(effsize, 2), # Round effect size to 2 decimal places
# label = paste(effsize, " (", magnitude, ")", sep = "") # Format label with magnitude
# )
# Create the plot
p <- ggpubr::ggboxplot(df_mut, x = "Variance_Type", y = y_var, fill = "Buffer_name",
palette = c("AI" = "royalblue", "AS" = "palegreen3")) +
facet_wrap(~ Buffer_name) +
labs(title = "", fill = "Buffer Type",
x = "", y = if (log_transform) paste("Log10(", y_var, ")", sep = "") else "Coefficient of Variation in % (maxLFQ)") +
theme_minimal() +
theme(text = element_text(size = 14)) +
scale_x_discrete(labels = named_labels)
# Add t-test results as text annotations
p <- p + ggpubr::stat_pvalue_manual(stat.test, label = "{p.signif}",
y.position = max(df$CVLFQ, na.rm = TRUE) * 1.05,
size = 4, color = "black")
# Print the plot
print(p)
# Save the plot
ggsave(file.path(sub_dir, filename), plot = p, width = 9, height = 6, bg = 'white')
return(p)
}
# Generate the summary plot for CVLFQ without log10 transformation
summary_cvlfq <- create_and_save_summary_plot(mean_variance_stats, "CVLFQ", log_transform = FALSE, final_out_dir, "Summary_CVLFQ_Boxplot.png")
## Warning: Removed 632 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's colour values.
## Warning: Removed 632 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's colour values.
# Combining the plots
combined_plot_2stepplus <- (summary_cvlfq | upset_plot) + plot_annotation(tag_levels = 'A') + plot_layout(widths = c(1,2)) +
plot_layout(guides = "collect") & theme(legend.position = "bottom")
# Save the combined figure
ggsave(file.path(final_out_dir, "2stepplus.png"), plot = combined_plot_2stepplus, width = 14, height = 6, dpi = 200)
## Warning: Removed 632 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's colour values.
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
print(combined_plot_2stepplus)
## Warning: Removed 632 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's colour values.
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
dataset_filtered_in_3_of_4_col <- dataset_filtered_in_3_of_4 %>%
filter(str_detect(Genes, "Col\\d"))
library(dplyr)
library(ggplot2)
# Calculate mean maxLfQ_intensities for dataset_filtered_in_3_of_4_col
mean_intensities_col <- dataset_filtered_in_3_of_4_col %>%
group_by(Method, Buffer_name, ID) %>%
summarize(mean_intensity = mean(maxLfQ_intensities, na.rm = TRUE)) %>%
ungroup()
## `summarise()` has grouped output by 'Method', 'Buffer_name'. You can override
## using the `.groups` argument.
# Calculate mean maxLfQ_intensities for dataset_filtered_in_3_of_4
mean_intensities_all <- dataset_filtered_in_3_of_4 %>%
group_by(Method, Buffer_name, ID) %>%
summarize(mean_intensity = mean(maxLfQ_intensities, na.rm = TRUE)) %>%
ungroup()
## `summarise()` has grouped output by 'Method', 'Buffer_name'. You can override
## using the `.groups` argument.
# Count the total number of IDs per Method x Buffer_name
total_ids_all <- mean_intensities_all %>%
group_by(Method, Buffer_name) %>%
summarize(total_ids = n())
## `summarise()` has grouped output by 'Method'. You can override using the
## `.groups` argument.
# Count the total number of IDs in _col per Method x Buffer_name
total_ids_col <- mean_intensities_col %>%
group_by(Method, Buffer_name) %>%
summarize(total_ids_col = n())
## `summarise()` has grouped output by 'Method'. You can override using the
## `.groups` argument.
# Merge the counts
total_ids <- total_ids_all %>%
left_join(total_ids_col, by = c("Method", "Buffer_name")) %>%
replace_na(list(total_ids_col = 0))
# Barplot
ggplot(total_ids, aes(x = Buffer_name, y = total_ids, fill = "All IDs")) +
geom_bar(stat = "identity", position = "dodge") +
geom_bar(aes(y = total_ids_col, fill = "Collagen IDs"), stat = "identity", position = "dodge") +
geom_text(aes(y = total_ids, label = total_ids), vjust = -0.5, position = position_dodge(width = 0.9)) +
geom_text(aes(y = total_ids_col, label = total_ids_col), vjust = -0.5, position = position_dodge(width = 0.9), color = "white") +
labs(title = "Total Number of IDs per Method x Buffer_name",
x = "Method x Buffer_name",
y = "Number of IDs") +
scale_fill_manual(name = "ID Type", values = c("All IDs" = "blue", "Collagen IDs" = "red")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 0), strip.text.x = element_text(size = 10),
panel.spacing = unit(0.1, "lines"), panel.border = element_rect(color = "black", fill = NA, size = 1)) +
facet_grid(~ Method, scales = "free", space = "free") +
ylim(0,5000)
# Add a column to indicate the dataset source
mean_intensities_all <- mean_intensities_all %>% mutate(Source = "All")
mean_intensities_col <- mean_intensities_col %>% mutate(Source = "Collagen")
# Combine the datasets
combined_intensities <- bind_rows(mean_intensities_all, mean_intensities_col)
# Boxplot
ggplot(combined_intensities, aes(x = Buffer_name, y = log10(mean_intensity), fill = Source)) +
geom_boxplot() +
labs(title = "Intensities for All and Collagen IDs per Method x Buffer_name",
x = "Method x Buffer_name",
y = "Intensity") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 0), strip.text.x = element_text(size = 10),
panel.spacing = unit(0.1, "lines"), panel.border = element_rect(color = "black", fill = NA, size = 1)) +
scale_fill_manual(name = "ID Type", values = c("All" = "blue", "Collagen" = "red")) +
facet_grid(~ Method, scales = "free", space = "free")
# Sum of intensities for all IDs
sum_intensities_all <- mean_intensities_all %>%
group_by(Method, Buffer_name) %>%
summarize(sum_intensity_all = sum(mean_intensity))
## `summarise()` has grouped output by 'Method'. You can override using the
## `.groups` argument.
# Sum of intensities for collagen IDs
sum_intensities_col <- mean_intensities_col %>%
group_by(Method, Buffer_name) %>%
summarize(sum_intensity_col = sum(mean_intensity))
## `summarise()` has grouped output by 'Method'. You can override using the
## `.groups` argument.
# Merge the sums
sum_intensities <- sum_intensities_all %>%
left_join(sum_intensities_col, by = c("Method", "Buffer_name")) %>%
replace_na(list(sum_intensity_col = 0))
# Calculate the percentage of intensities in _col
sum_intensities <- sum_intensities %>%
mutate(percentage_col = (sum_intensity_col / sum_intensity_all) * 100)
# Barplot for sum of intensities
ggplot(sum_intensities, aes(x = Buffer_name, y = sum_intensity_all, fill = "All Intensities")) +
geom_bar(stat = "identity", position = "dodge") +
geom_bar(aes(y = sum_intensity_col, fill = "Collagen Intensities"), stat = "identity", position = "dodge") +
labs(title = "Sum of Intensities per Method x Buffer_name",
x = "Method x Buffer_name",
y = "Sum of Intensities") +
scale_fill_manual(name = "Intensity Type", values = c("All Intensities" = "blue", "Collagen Intensities" = "red")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 0), strip.text.x = element_text(size = 10),
panel.spacing = unit(0.1, "lines"), panel.border = element_rect(color = "black", fill = NA, size = 1)) +
facet_grid(~ Method, scales = "free", space = "free")
# Print the percentages
print(sum_intensities)
## # A tibble: 10 × 5
## # Groups: Method [5]
## Method Buffer_name sum_intensity_all sum_intensity_col percentage_col
## <chr> <fct> <dbl> <dbl> <dbl>
## 1 1step single 310178429. 23301675. 7.51
## 2 1step+C single 141478011. 4264569. 3.01
## 3 2step AS 202308694. 9666053. 4.78
## 4 2step AI 165454203. 11862848. 7.17
## 5 2step+M AS 221498463. 21260730. 9.60
## 6 2step+M AI 176002435. 13063953. 7.42
## 7 4step AS1 211851742. 9326736. 4.40
## 8 4step AI1 175606308. 10654546. 6.07
## 9 4step AI2 153282257. 20497227. 13.4
## 10 4step AS2 254838181. 144248511. 56.6
# Sum of intensities for all IDs
sum_intensities_all <- mean_intensities_all %>%
group_by(Method, Buffer_name) %>%
summarize(sum_intensity_all = sum(mean_intensity))
## `summarise()` has grouped output by 'Method'. You can override using the
## `.groups` argument.
# Sum of intensities for collagen IDs
sum_intensities_col <- mean_intensities_col %>%
group_by(Method, Buffer_name) %>%
summarize(sum_intensity_col = sum(mean_intensity))
## `summarise()` has grouped output by 'Method'. You can override using the
## `.groups` argument.
# Merge the sums
sum_intensities <- sum_intensities_all %>%
left_join(sum_intensities_col, by = c("Method", "Buffer_name")) %>%
replace_na(list(sum_intensity_col = 0))
# Calculate the percentage of intensities in _col
sum_intensities <- sum_intensities %>%
mutate(percentage_col = (sum_intensity_col / sum_intensity_all) * 100,
percentage_all = 100)
# Barplot for sum of intensities
p <- ggplot(sum_intensities, aes(x = Buffer_name, y = percentage_all, fill = "All Intensities")) +
geom_bar(stat = "identity", position = "dodge") +
geom_bar(aes(y = percentage_col, fill = "Collagen Intensities"), stat = "identity", position = "dodge") +
geom_text(aes(y = percentage_col, label = sprintf("%.2f%%", percentage_col)), vjust = -0.5, position = position_dodge(width = 0.9), color = "white") +
labs(title = "Percentage of Intensities per Method x Buffer_name",
x = "Method x Buffer_name",
y = "Percentage of Intensities") +
scale_fill_manual(name = "Intensity Type", values = c("All Intensities" = "blue", "Collagen Intensities" = "red")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 0), strip.text.x = element_text(size = 10),
panel.spacing = unit(0.1, "lines"), panel.border = element_rect(color = "black", fill = NA, size = 1)) +
facet_grid(~ Method, scales = "free", space = "free")
# Print the percentages
print(sum_intensities)
## # A tibble: 10 × 6
## # Groups: Method [5]
## Method Buffer_name sum_intensity_all sum_intensity_col percentage_col
## <chr> <fct> <dbl> <dbl> <dbl>
## 1 1step single 310178429. 23301675. 7.51
## 2 1step+C single 141478011. 4264569. 3.01
## 3 2step AS 202308694. 9666053. 4.78
## 4 2step AI 165454203. 11862848. 7.17
## 5 2step+M AS 221498463. 21260730. 9.60
## 6 2step+M AI 176002435. 13063953. 7.42
## 7 4step AS1 211851742. 9326736. 4.40
## 8 4step AI1 175606308. 10654546. 6.07
## 9 4step AI2 153282257. 20497227. 13.4
## 10 4step AS2 254838181. 144248511. 56.6
## # ℹ 1 more variable: percentage_all <dbl>
print(p)
# Calculate mean maxLfQ_intensities for dataset_filtered_in_3_of_4_col
mean_intensities_col <- dataset_filtered_in_3_of_4_col %>%
group_by(Method, Buffer_name, ID, Genes) %>%
summarize(mean_intensity = mean(maxLfQ_intensities, na.rm = TRUE)) %>%
ungroup()
## `summarise()` has grouped output by 'Method', 'Buffer_name', 'ID'. You can
## override using the `.groups` argument.
# Calculate mean maxLfQ_intensities for dataset_filtered_in_3_of_4
mean_intensities_all <- dataset_filtered_in_3_of_4 %>%
group_by(Method, Buffer_name, ID, Genes) %>%
summarize(mean_intensity = mean(maxLfQ_intensities, na.rm = TRUE)) %>%
ungroup()
## `summarise()` has grouped output by 'Method', 'Buffer_name', 'ID'. You can
## override using the `.groups` argument.
# Rank the data by mean_intensity for each Method x Buffer_name combination
ranked_data <- mean_intensities_all %>%
group_by(Method, Buffer_name) %>%
mutate(rank = rank(-mean_intensity, ties.method = "first")) %>%
ungroup()
ranked_data <- ranked_data %>%
mutate(Buffer_name = factor(Buffer_name, levels = c("single", "AS", "AI", "AS1", "AI1", "AI2", "AS2"))) %>%
mutate(Buffer_name = recode(Buffer_name, "single" = "Single"))
# Create the plot
p <- ggplot(ranked_data, aes(x = rank, y = log10(mean_intensity), color = Buffer_name)) +
geom_point(size = 0.5) +
labs(title = "Log10 Mean Intensity vs Rank by Method",
x = "Rank",
y = "Log10 Mean Intensity",
color = "Buffer Name") +
facet_wrap(~ Method, ncol = 2) + #, scales = "free_x") +
theme_bw() +
theme(strip.text = element_text(size = 10),
panel.spacing = unit(0.5, "lines"),
panel.border = element_rect(color = "black", fill = NA, size = 1)) +
guides(color = guide_legend(override.aes = list(size = 5)))
ggsave(file.path(final_out_dir, "Supplement1.png"), plot = p, width = 8, height = 4)
print(p)
library(purrr)
ranked_data_col <- ranked_data %>% filter(str_detect(Genes, "Col\\d"))
# Create the table, sorted by Genes
ranked_table <- ranked_data_col %>%
group_by(Method, Buffer_name) %>%
arrange(rank, .by_group = TRUE) %>%
select(Method, Buffer_name, Rank = rank, ID, Genes, mean_intensity) %>%
ungroup()
# Split the dataframe into a list of dataframes, one for each Method_Buffer_name combination
ranked_table_list <- ranked_table %>%
split(list(.$Method, .$Buffer_name))
# Save each sub-dataframe as a CSV file
purrr::walk2(names(ranked_table_list), ranked_table_list, function(name, data) {
if(nrow(data) > 0) { # Check if the dataframe is not empty
filename <- paste0(gsub(" ", "_", name), ".csv")
write.csv(data, filename, row.names = FALSE)
}
})
# Function to get the rows with the highest and lowest mean intensity for each group
get_extreme_rows <- function(df) {
df %>%
filter(mean_intensity == max(mean_intensity) | mean_intensity == min(mean_intensity))
}
# Apply the function to each Method x Buffer_name group
filtered_data <- ranked_data %>%
group_by(Method, Buffer_name) %>%
do(get_extreme_rows(.))
# Ungroup the data for cleaner output
filtered_data <- filtered_data %>% ungroup()
filtered_data$log10 <- log10(filtered_data$mean_intensity)
# Plot the data
ggplot(ranked_data, aes(x = rank, y = log10(mean_intensity), color = Buffer_name)) +
geom_point() +
labs(title = "Log10 Mean Intensity vs Rank by Method",
x = "Rank",
y = "Log10 Mean Intensity",
color = "Buffer Name") +
facet_wrap(~ Method, ncol = 2) +
theme_minimal() +
theme(strip.text = element_text(size = 10),
panel.spacing = unit(0.5, "lines"),
panel.border = element_rect(color = "black", fill = NA, size = 1))
library(ggpointdensity)
library(viridis)
## Loading required package: viridisLite
##
## Attaching package: 'viridis'
## The following object is masked from 'package:scales':
##
## viridis_pal
# Function to create scatterplot with correlation
create_scatterplot_with_correlation <- function(df, method_label, log_scale = FALSE) {
# Step 1: Filter the dataset for the specified method
filtered_df <- df %>%
filter(Method == method_label)
# Step 2: Identify shared IDs
shared_ids <- filtered_df %>%
group_by(ID) %>%
filter(n_distinct(Buffer_name) > 1) %>%
pull(ID) %>%
unique()
# Step 3: Subset the data for shared IDs
shared_data <- filtered_df %>%
filter(ID %in% shared_ids)
# Step 4: Reshape the data to have one row per ID with columns for each buffer
shared_data_wide <- shared_data %>%
select(-SDLFQ, -CVLFQ) %>%
pivot_wider(names_from = Buffer_name, values_from = MeanLFQ, names_prefix = "MeanLFQ_")
# Step 5: Apply log scale if flag is set
if (log_scale) {
shared_data_wide <- shared_data_wide %>%
mutate(across(starts_with("MeanLFQ_"), log2))
}
# Step 6: Calculate the correlation
correlation <- cor(shared_data_wide$MeanLFQ_AS, shared_data_wide$MeanLFQ_AI, use = "complete.obs")
# Step 7: Create the scatterplot
p <- ggplot(shared_data_wide, aes(x = MeanLFQ_AS, y = MeanLFQ_AI)) +
#geom_point(color = "#0072B2") +
geom_pointdensity(size = 0.5) + scale_color_viridis(name = "Density") +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray") +
labs(title = paste("Scatterplot of Shared IDs between\nAS and AI Buffers in", method_label, "Method"),
x = if (log_scale) "Log2 MeanLFQ Intensity in AS Buffer" else "MeanLFQ Intensity in AS Buffer",
y = if (log_scale) "Log2 MeanLFQ Intensity in AI Buffer" else "MeanLFQ Intensity in AI Buffer") +
theme_minimal(base_size = 14) + # Adjust base size for smaller text
coord_fixed(xlim = c(10, 25), ylim = c(10, 25)) + # Fixed x and y axis limits
scale_x_continuous(breaks = seq(10, 25, by = 5)) +
scale_y_continuous(breaks = seq(10, 25, by = 5)) +
annotate("text", x = Inf, y = -Inf, label = paste("Corr.:", round(correlation, 2)),
hjust = 1.1, vjust = -1.5, size = 4, color = "black")
print(p)
}
# Create scatterplots for "2step" and "2step+"
create_scatterplot_with_correlation(stats_df, "2step", log_scale = T)
#create_scatterplot_with_correlation(stats_df, "2step+M", log_scale = T)
for 2step+M in technical variance
library(ggpointdensity)
library(viridis)
library(patchwork) # To arrange plots
# Function to create scatterplot with correlation for multiple replicates
create_scatterplot_with_correlation_multiple_replicates <- function(df, method_label, log_scale = FALSE) {
# Step 1: Filter the dataset for the specified method
filtered_df <- df %>%
filter(Method == method_label)
filtered_df$Replicate <- factor(filtered_df$Replicate, levels = c(1, 2, 3, 4))
# Step 2: Identify shared IDs for each replicate
shared_ids <- filtered_df %>%
group_by(ID) %>%
filter(n_distinct(Buffer_name) > 1) %>%
pull(ID) %>%
unique()
# Step 3: Subset the data for shared IDs
shared_data <- filtered_df %>%
filter(ID %in% shared_ids)
# Step 4: Reshape the data to have one row per ID with columns for each buffer
shared_data_wide <- shared_data %>%
select(-SDLFQ, -CVLFQ) %>%
pivot_wider(names_from = Buffer_name, values_from = MeanLFQ, names_prefix = "MeanLFQ_")
# Step 5: Apply log scale if flag is set
if (log_scale) {
shared_data_wide <- shared_data_wide %>%
mutate(across(starts_with("MeanLFQ_"), log2))
}
# Step 6: Calculate correlation for each replicate and add as a new column
correlation_df <- shared_data_wide %>%
group_by(Replicate) %>%
summarize(corr_value = round(cor(MeanLFQ_AS, MeanLFQ_AI, use = "complete.obs"), 2))
# Merge correlation data back to the main data frame for annotation
shared_data_wide <- left_join(shared_data_wide, correlation_df, by = "Replicate")
# Step 7: Create a scatterplot with facets for each replicate
p <- ggplot(shared_data_wide, aes(x = MeanLFQ_AS, y = MeanLFQ_AI)) +
geom_pointdensity(size = 0.3) +
scale_color_viridis(name = "Density") +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray") +
labs(title = paste("Scatterplot of Shared IDs between\nAS and AI Buffers in", method_label, "Method"),
x = if (log_scale) "Log2 MeanLFQ Intensity in AS Buffer" else "MeanLFQ Intensity in AS Buffer",
y = if (log_scale) "Log2 MeanLFQ Intensity in AI Buffer" else "MeanLFQ Intensity in AI Buffer") +
theme_minimal(base_size = 14) + # Adjust base size for smaller text
coord_fixed(xlim = c(10, 25), ylim = c(10, 25)) + # Fixed x and y axis limits
scale_x_continuous(breaks = seq(10, 25, by = 5)) +
scale_y_continuous(breaks = seq(10, 25, by = 5)) +
facet_wrap(~ Replicate, ncol = 2, labeller = labeller(Replicate = function(x) paste("Technical Replicate", x))) +
geom_text(data = correlation_df, aes(x = 25, y = 11, label = paste("Corr.:", corr_value)),
size = 4, color = "black", hjust = 1, vjust = 1) # Position text at the bottom right
print(p)
return(p)
}
# Create scatterplots for "2step" from stats_df
scatter_2step <- create_scatterplot_with_correlation(stats_df, "2step", log_scale = TRUE)
# Create scatterplots for "2step+M" from stats_df_bio, separated by replicates
scatter_2stepplus <- create_scatterplot_with_correlation_multiple_replicates(stats_df_bio, "2step+M", log_scale = TRUE)
## Warning: Removed 1570 rows containing missing values or values outside the scale range
## (`stat_pointdensity()`).
# Combine scatter_2step and scatter_2stepplus into one plot with tags A and B
combined_plot <- (scatter_2step | scatter_2stepplus) +
#plot_layout(widths = c(0.49, 0.51)) +
plot_annotation(tag_levels = 'A')
# Save the combined figure
ggsave(file.path(final_out_dir, "Supplement2.png"), plot = combined_plot, width = 12, height = 9)
## Warning: Removed 1570 rows containing missing values or values outside the scale range
## (`stat_pointdensity()`).
print(combined_plot)
## Warning: Removed 1570 rows containing missing values or values outside the scale range
## (`stat_pointdensity()`).
Mean per Protein.Group
library(scales)
plot_df <- merged_attributes_df_repl[merged_attributes_df_repl$Method %in% c("2step", "2step+M")]
# Function to plot each property
plot_property <- function(df, property, ylab, log_scale = FALSE) {
p <- ggplot(df, aes(x = Buffer_name, y = get(property), fill = Method)) +
geom_split_violin() +
theme_minimal() +
scale_fill_manual(values = c("#E69F00", "#56B4E9")) +
theme(axis.title.x = element_blank(), legend.position = "none") +
theme(text = element_text(size = 14),axis.text.x = element_text(size = 14))
if (log_scale) {
p <- p + scale_y_log10(labels = label_number()) +
labs(title = "", x = "Buffer Type", y = ylab) +
theme(
text = element_text(size = 14),
axis.text.x = element_text(size = 14), # Larger x-axis labels
axis.text.y = element_text(size = 10), # Log10 scaling applied
panel.grid.minor.y = element_line(color="grey", linetype="dotted"),
panel.grid.major.y = element_line(color="grey", linetype="solid")
) + scale_y_log10(breaks = breaks, minor_breaks = minor_breaks, labels = label_number()) +
annotation_logticks(sides="l")
} else {
p <- p + labs(title = "", x = "Buffer Type", y = ylab)
}
return(p)
}
plot_df$mean_mw <- plot_df$mean_mw/1000
# Plot for pI
plot_pI <- plot_property(plot_df, "mean_pI", ylab = "Isoelectric point (pI)")
# Plot for mw with logarithmic scale
plot_mw <- plot_property(plot_df, "mean_mw", ylab = "Molecular weight [kDa]", log_scale = TRUE)
## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.
# Plot for gravy
plot_gravy <- plot_property(plot_df, "mean_gravy", ylab = "GRAVY score")
# Combine the plots using patchwork and add a global title
combined_plot <- (plot_pI | plot_mw | plot_gravy) +
plot_annotation(tag_levels = 'A') +
plot_layout(guides = "collect") &
theme(
legend.position = "bottom",
plot.tag = element_text(size = 14) # Adjust size as needed for tag consistency
)
ggsave(file.path(final_out_dir, "Supplement3.png"), plot = combined_plot, width = 8, height = 4)
## Warning: Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
# Display the combined plot
print(combined_plot)
## Warning: Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
# Create binary presence/absence matrix per method + buffer
binary_presence_method <- dataset_filtered_in_3_of_4 %>%
mutate(Method_Buffer = paste(Method, Buffer_name, sep = "_")) %>%
distinct(ID, Method_Buffer) %>%
pivot_wider(
names_from = Method_Buffer,
values_from = Method_Buffer,
values_fill = list(Method_Buffer = 0),
values_fn = list(Method_Buffer = function(x) 1)
) %>%
tibble::column_to_rownames(var = "ID")
# Add an 'all' column to highlight full overlaps
binary_presence_method$all <- 1
# Optional: Reorder columns (manually or automatically)
# Example: binary_presence_method <- binary_presence_method[, c(5,4,3,2,1,6)]
binary_presence_method <- binary_presence_method[, c(setdiff(names(binary_presence_method), "all"), "all")]
# Identify method+buffer columns (excluding 'all')
method_columns <- setdiff(names(binary_presence_method), "all")
# Query: full intersection of all sets (including "all")
full_intersection_query <- ComplexUpset::upset_query(
intersect = names(binary_presence_method),
color = 'orange',
fill = 'orange',
only_components = c('intersections_matrix')
)
# Queries: exact 2-way intersections ("all" + one method_buffer)
pairwise_queries <- lapply(method_columns, function(method) {
ComplexUpset::upset_query(
intersect = c("all", method),
color = "steelblue",
fill = "steelblue",
only_components = c('intersections_matrix', 'No. of\nintersecting\nproteins')
)
})
# Query: highlight just the "all" set bar (not intersections)
all_set_query <- ComplexUpset::upset_query(
set = "all",
fill = "orange"
)
# Combine all queries
query_list <- c()#c(list(full_intersection_query), pairwise_queries, list(all_set_query))
# Generate the plot
upset_plot_method <- ComplexUpset::upset(
data = binary_presence_method,
intersect = names(binary_presence_method),
name = "",
wrap = TRUE,
themes = upset_modify_themes(list(
'intersections_matrix' = theme()
)),
base_annotations = list(
'No. of\nintersecting\nproteins' = intersection_size(
bar_number_threshold = 1,
text=list(size = 3,
vjust=-0.2,
hjust=-0.2,
angle=45
)) +
expand_limits(y = 600)
),
set_sizes = (
upset_set_size()
+ geom_text(aes(label = ..count..), hjust = 1.1, stat = 'count')#, size = 4)
+ annotate(geom = 'text', label = '@', x = 'Count', y = 850, color = 'white')#, size = 10)
+ expand_limits(y = 7000)
+ theme()#axis.title.x = element_text(size = 13), axis.text.x = element_text(size = 10))
),
sort_sets = FALSE,
queries = query_list,
sort_intersections_by = 'ratio'
) +
theme()#text = element_text(size = 13), axis.title.x = element_text(size = 13))
# Show the plot
print(upset_plot_method)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
# Optionally save the plot
ggsave(file.path(final_out_dir, "Upset_Method_Buffer_34.png"),
plot = upset_plot_method, width = 30, height = 7, bg = 'white')
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
# Create binary presence/absence matrix per method
binary_presence_method <- dataset_filtered %>%
distinct(ID, Method) %>%
pivot_wider(names_from = Method, values_from = Method,
values_fill = list(Method = 0),
values_fn = list(Method = function(x) 1)) %>%
tibble::column_to_rownames(var = "ID")
# Add an 'all' column to highlight full overlaps
binary_presence_method$all <- 1
# Optional: Reorder columns (if specific order is desired)
#binary_presence_method <- binary_presence_method[, c(setdiff(names(binary_presence_method), "all"), "all")]
binary_presence_method <- binary_presence_method[, c(5,2,1,4,3,6)]
# Identify method columns (excluding 'all')
method_columns <- setdiff(names(binary_presence_method), "all")
# Query: full intersection of all sets (including "all")
full_intersection_query <- ComplexUpset::upset_query(
intersect = names(binary_presence_method),
color = 'orange',
fill = 'orange',
only_components = c('intersections_matrix')
)
# Queries: exact 2-way intersections ("all" + one method)
pairwise_queries <- lapply(method_columns, function(method) {
ComplexUpset::upset_query(
intersect = c("all", method),
color = "steelblue",
fill = "steelblue",
only_components = c('intersections_matrix', 'No. of\nintersecting\nproteins')
)
})
# Query: highlight just the "all" set bar (not intersections)
all_set_query <- ComplexUpset::upset_query(
set = "all",
fill = "orange"
)
# Combine all queries
query_list <- c(list(full_intersection_query), pairwise_queries, list(all_set_query))
# Generate the plot
upset_plot_method <- ComplexUpset::upset(
data = binary_presence_method,
intersect = names(binary_presence_method),
name = "",
wrap = TRUE,
themes = upset_modify_themes(list(
'intersections_matrix' = theme(axis.text.y = element_text(size = 13))
)),
base_annotations = list(
'No. of\nintersecting\nproteins' = intersection_size(
bar_number_threshold = 1,
text=list(
vjust=-0.1,
hjust=0.1,
angle=45
)) +
expand_limits(y = 2500) +
theme(
panel.grid.minor = element_blank(),
axis.text.y = element_text(size = 13),
axis.title.y = element_text(size = 13)
)
),
set_sizes = (
upset_set_size()
+ geom_text(aes(label = ..count..), hjust = 1.1, stat = 'count', size = 4)
+ annotate(geom = 'text', label = '@', x = 'Count', y = 850, color = 'white', size = 10)
+ expand_limits(y = 7000)
+ theme(axis.title.x = element_text(size = 13), axis.text.x = element_text(size = 10))
),
sort_sets = FALSE,
queries = query_list,
sort_intersections_by = 'ratio'
) + theme(text = element_text(size = 13), axis.title.x = element_text(size = 13))
# Save the plot
ggsave(file.path(final_out_dir, "Upset_Method_All.png"), plot = upset_plot_method, width = 10, height = 6, bg = 'white')
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
# Show the plot
print(upset_plot_method)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
#BiocManager::install("clusterProfiler")
#BiocManager::install("org.Rn.eg.db") # Rat gene database
library(dplyr)
library(clusterProfiler)
##
## clusterProfiler v4.12.6 Learn more at https://yulab-smu.top/contribution-knowledge-mining/
##
## Please cite:
##
## S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang,
## W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize
## multiomics data. Nature Protocols. 2024, 19(11):3292-3320
##
## Attaching package: 'clusterProfiler'
## The following object is masked from 'package:XVector':
##
## slice
## The following object is masked from 'package:IRanges':
##
## slice
## The following object is masked from 'package:S4Vectors':
##
## rename
## The following object is masked from 'package:purrr':
##
## simplify
## The following object is masked from 'package:stats':
##
## filter
library(org.Rn.eg.db)
## Loading required package: AnnotationDbi
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following objects are masked from 'package:matrixStats':
##
## anyMissing, rowMedians
##
## Attaching package: 'AnnotationDbi'
## The following object is masked from 'package:clusterProfiler':
##
## select
## The following object is masked from 'package:rstatix':
##
## select
## The following object is masked from 'package:dplyr':
##
## select
##
# Step 1: Split by method
list_of_dataframes <- split(dataset_filtered_in_3_of_4, dataset_filtered_in_3_of_4$Method)
# Step 2: Find all IDs and their method counts
id_method_counts <- dataset_filtered_in_3_of_4[, .(Method, ID)] %>% distinct()
id_counts <- as.data.frame(table(id_method_counts$ID))
# Keep only IDs that appear in ONE method
exclusive_ids <- id_counts$Var1[id_counts$Freq == 1] # change if you want to include all
# Step 3: Function to extract exclusive genes per method
process_exclusive_genes <- function(df, method_name) {
df_exclusive <- df[df$ID %in% exclusive_ids, ]
genes_split <- strsplit(as.character(df_exclusive$Genes), ";")
unique_genes <- unique(unlist(genes_split))
return(unique_genes)
}
# Step 4: Apply to each method's dataframe
list_of_gene_vectors <- mapply(
FUN = process_exclusive_genes,
df = list_of_dataframes,
method_name = names(list_of_dataframes),
SIMPLIFY = FALSE
)
# Step 5: For each method, extract exclusive protein metadata
exclusive_protein_info <- lapply(names(list_of_dataframes), function(method_name) {
df <- list_of_dataframes[[method_name]]
df_exclusive <- df[df$ID %in% exclusive_ids, ]
df_selected <- df_exclusive[, c("ID", "Protein.Group", "Protein.Ids", "Protein.Names", "Genes")] %>% distinct()
return(df_selected)
})
# Name the list for sheet names
names(exclusive_protein_info) <- names(list_of_dataframes)
# Step 6: Save to Excel
writexl::write_xlsx(exclusive_protein_info, path = file.path(final_out_dir, "Exclusive_Proteins_Per_Method.xlsx"))
# Set up enrichment analysis parameters
keytype <- "SYMBOL" # Assuming gene symbols are used
universe <- keys(org.Rn.eg.db, keytype)
# Function to perform GO analysis
run_go_enrichment <- function(genes) {
# Create an enriched GO object
ego <- enrichGO(gene = genes,
universe = universe,
OrgDb = org.Rn.eg.db,
keyType = keytype,
ont = "BP", # Biological Process
pAdjustMethod = "BH",
qvalueCutoff = 1)
return(ego)
}
# Apply GO enrichment to each gene vector
results_go_enrichment <- lapply(list_of_gene_vectors, run_go_enrichment)
# Apply GO enrichment to each gene vector
#random_results_go_enrichment <- lapply(random_list_of_gene_vectors, run_go_enrichment)
# Function to filter existing enrichment results
filter_go_results <- function(enrich_result) {
# Check if the object has results and filter
df <- data.frame(enrich_result@result)
if (length(df) > 0) { # Ensure there are results to filter
filtered_results <- df[df$p.adjust < 0.05, ]
# filtered_results <- df[df$pvalue < 0.05, ]
return(filtered_results)
} else {
return(data.frame()) # Return an empty data.frame if no results
}
}
# Apply the filtering function to each set of enrichment results
filtered_results_go_enrichment <- lapply(results_go_enrichment, filter_go_results)
# Convert to list of data frames with names
all_enrich_named <- setNames(
lapply(results_go_enrichment, function(x) as.data.frame(x@result)),
names(results_go_enrichment)
)
# Save to Excel
writexl::write_xlsx(all_enrich_named, path = file.path(final_out_dir, "BP_GO_All_Enrichment_Results.xlsx"))
filtered_enrich_named <- setNames(
filtered_results_go_enrichment,
names(filtered_results_go_enrichment)
)
# Save to Excel
writexl::write_xlsx(filtered_enrich_named, path = file.path(final_out_dir, "BP_GO_Filtered_Enrichment_Results.xlsx"))
# Function to extract GO term IDs from results
extract_go_ids <- function(df) {
if (nrow(df) > 0) { # Check if dataframe is not empty
return(df$ID)
} else {
return(character(0)) # Return empty character vector if no GO terms
}
}
# Extract GO term IDs from each filtered result
list_of_go_ids <- lapply(filtered_results_go_enrichment, extract_go_ids)
# Assuming list_of_go_ids is available and named per method or condition
names(list_of_go_ids) <- names(filtered_results_go_enrichment) # Ensure names are properly assigned
# Combine all GO term IDs into a single dataframe
all_go_ids <- unique(unlist(list_of_go_ids))
binary_matrix <- sapply(list_of_go_ids, function(ids) as.integer(all_go_ids %in% ids))
# Convert to dataframe for UpSetR
binary_df <- data.frame(binary_matrix, row.names = all_go_ids)
names(binary_df) <- rev(names(filtered_results_go_enrichment))
upset_plot <-ComplexUpset::upset(data=binary_df,
intersect=names(binary_df), name="Extraction Methods",
wrap=TRUE,
base_annotations=list(
'Intersection size'=intersection_size(
text=list(
size = 2.5
)
)
),
set_sizes=(
upset_set_size()
+ geom_text(aes(label=..count..), hjust=1.1, stat='count')
+ annotate(geom='text', label='@', x='Count', y=850, color='white', size=3)
+ expand_limits(y=7000)
),
sort_sets=FALSE,
queries=list(
upset_query(intersect = c("1step"), color = "orchid", fill='orchid', only_components=c('intersections_matrix', 'Intersection size')),
upset_query(intersect = c("1step+C"), color = "orchid", fill='orchid', only_components=c('intersections_matrix', 'Intersection size')),
upset_query(intersect = c("2step"), color = "orchid", fill='orchid', only_components=c('intersections_matrix', 'Intersection size')),
upset_query(intersect = c("2step+M"), color = "orchid", fill='orchid', only_components=c('intersections_matrix', 'Intersection size')),
upset_query(intersect = c("4step"), color = "orchid", fill='orchid', only_components=c('intersections_matrix', 'Intersection size'))
),
sort_intersections_by='cardinality') + ggtitle('Intersection of Biological Processes (BP) GO Terms in Extraction Methods')
# Save the plot
ggsave(
filename = file.path(final_out_dir, "Comparison_of_BP_GO_Terms_per_Method.png"),
plot = upset_plot,
width = 10,
height = 6,
bg = 'white'
)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_segment()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
# Print the plot to the console
print(upset_plot)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_segment()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
all_ids <- lapply(filtered_results_go_enrichment, function(df) df$ID)
top10_exclusive_results <- list()
for (name in names(filtered_results_go_enrichment)) {
this_df <- filtered_results_go_enrichment[[name]]
this_ids <- this_df$ID
other_ids <- unique(unlist(all_ids[names(all_ids) != name]))
exclusive_df <- this_df[this_df$ID %in% setdiff(this_ids, other_ids), ]
top10 <- exclusive_df %>% arrange(p.adjust) %>% head(10)
top10_exclusive_results[[name]] <- top10
}
writexl::write_xlsx(top10_exclusive_results, path = file.path(final_out_dir, "BP_GO_Filtered_Top10_Exclusive_Results.xlsx"))
# Set up enrichment analysis parameters
keytype <- "SYMBOL" # Assuming gene symbols are used
universe <- keys(org.Rn.eg.db, keytype)
# Function to perform GO analysis
run_go_enrichment <- function(genes) {
# Create an enriched GO object
ego <- enrichGO(gene = genes,
universe = universe,
OrgDb = org.Rn.eg.db,
keyType = keytype,
ont = "CC", # Cellular Component
pAdjustMethod = "BH",
qvalueCutoff = 1)
return(ego)
}
# Apply GO enrichment to each gene vector
results_go_enrichment <- lapply(list_of_gene_vectors, run_go_enrichment)
# Apply GO enrichment to each gene vector
#random_results_go_enrichment <- lapply(random_list_of_gene_vectors, run_go_enrichment)
# Function to filter existing enrichment results
filter_go_results <- function(enrich_result) {
# Check if the object has results and filter
df <- data.frame(enrich_result@result)
if (length(df) > 0) { # Ensure there are results to filter
filtered_results <- df[df$p.adjust < 0.05, ]
# filtered_results <- df[df$pvalue < 0.05, ]
return(filtered_results)
} else {
return(data.frame()) # Return an empty data.frame if no results
}
}
# Apply the filtering function to each set of enrichment results
filtered_results_go_enrichment <- lapply(results_go_enrichment, filter_go_results)
# Convert to list of data frames with names
all_enrich_named <- setNames(
lapply(results_go_enrichment, function(x) as.data.frame(x@result)),
names(results_go_enrichment)
)
# Save to Excel
writexl::write_xlsx(all_enrich_named, path = file.path(final_out_dir, "CC_GO_All_Enrichment_Results.xlsx"))
filtered_enrich_named <- setNames(
filtered_results_go_enrichment,
names(filtered_results_go_enrichment)
)
# Save to Excel
writexl::write_xlsx(filtered_enrich_named, path = file.path(final_out_dir, "CC_GO_Filtered_Enrichment_Results.xlsx"))
# Function to extract GO term IDs from results
extract_go_ids <- function(df) {
if (nrow(df) > 0) { # Check if dataframe is not empty
return(df$ID)
} else {
return(character(0)) # Return empty character vector if no GO terms
}
}
# Extract GO term IDs from each filtered result
list_of_go_ids <- lapply(filtered_results_go_enrichment, extract_go_ids)
# Assuming list_of_go_ids is available and named per method or condition
names(list_of_go_ids) <- names(filtered_results_go_enrichment) # Ensure names are properly assigned
# Combine all GO term IDs into a single dataframe
all_go_ids <- unique(unlist(list_of_go_ids))
binary_matrix <- sapply(list_of_go_ids, function(ids) as.integer(all_go_ids %in% ids))
# Convert to dataframe for UpSetR
binary_df <- data.frame(binary_matrix, row.names = all_go_ids)
names(binary_df) <- rev(names(filtered_results_go_enrichment))
upset_plot <-ComplexUpset::upset(data=binary_df,
intersect=names(binary_df), name="Extraction Methods",
wrap=TRUE,
base_annotations=list(
'Intersection size'=intersection_size(
text=list(
size = 2.5
)
)
),
set_sizes=(
upset_set_size()
+ geom_text(aes(label=..count..), hjust=1.1, stat='count')
+ annotate(geom='text', label='@', x='Count', y=850, color='white', size=3)
+ expand_limits(y=7000)
),
sort_sets=FALSE,
queries=list(
upset_query(intersect = c("1step"), color = "orchid", fill='orchid', only_components=c('intersections_matrix', 'Intersection size')),
upset_query(intersect = c("1step+C"), color = "orchid", fill='orchid', only_components=c('intersections_matrix', 'Intersection size')),
upset_query(intersect = c("2step"), color = "orchid", fill='orchid', only_components=c('intersections_matrix', 'Intersection size')),
upset_query(intersect = c("2step+M"), color = "orchid", fill='orchid', only_components=c('intersections_matrix', 'Intersection size')),
upset_query(intersect = c("4step"), color = "orchid", fill='orchid', only_components=c('intersections_matrix', 'Intersection size'))
),
sort_intersections_by='cardinality') + ggtitle('Intersection of Cellular Component (CC) GO Terms in Extraction Methods')
# Save the plot
ggsave(
filename = file.path(final_out_dir, "Comparison_of_CC_GO_Terms_per_Method.png"),
plot = upset_plot,
width = 10,
height = 6,
bg = 'white'
)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_segment()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
# Print the plot to the console
print(upset_plot)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_segment()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
all_ids <- lapply(filtered_results_go_enrichment, function(df) df$ID)
top10_exclusive_results <- list()
for (name in names(filtered_results_go_enrichment)) {
this_df <- filtered_results_go_enrichment[[name]]
this_ids <- this_df$ID
other_ids <- unique(unlist(all_ids[names(all_ids) != name]))
exclusive_df <- this_df[this_df$ID %in% setdiff(this_ids, other_ids), ]
top10 <- exclusive_df %>% arrange(p.adjust) %>% head(10)
top10_exclusive_results[[name]] <- top10
}
writexl::write_xlsx(top10_exclusive_results, path = file.path(final_out_dir, "CC_GO_Filtered_Top10_Exclusive_Results.xlsx"))
# Set up enrichment analysis parameters
keytype <- "SYMBOL" # Assuming gene symbols are used
universe <- keys(org.Rn.eg.db, keytype)
# Function to perform GO analysis
run_go_enrichment <- function(genes) {
# Create an enriched GO object
ego <- enrichGO(gene = genes,
universe = universe,
OrgDb = org.Rn.eg.db,
keyType = keytype,
ont = "MF", # Molecular Function
pAdjustMethod = "BH",
qvalueCutoff = 1)
return(ego)
}
# Apply GO enrichment to each gene vector
results_go_enrichment <- lapply(list_of_gene_vectors, run_go_enrichment)
# Apply GO enrichment to each gene vector
#random_results_go_enrichment <- lapply(random_list_of_gene_vectors, run_go_enrichment)
# Function to filter existing enrichment results
filter_go_results <- function(enrich_result) {
# Check if the object has results and filter
df <- data.frame(enrich_result@result)
if (length(df) > 0) { # Ensure there are results to filter
filtered_results <- df[df$p.adjust < 0.05, ]
# filtered_results <- df[df$pvalue < 0.05, ]
return(filtered_results)
} else {
return(data.frame()) # Return an empty data.frame if no results
}
}
# Apply the filtering function to each set of enrichment results
filtered_results_go_enrichment <- lapply(results_go_enrichment, filter_go_results)
# Convert to list of data frames with names
all_enrich_named <- setNames(
lapply(results_go_enrichment, function(x) as.data.frame(x@result)),
names(results_go_enrichment)
)
# Save to Excel
writexl::write_xlsx(all_enrich_named, path = file.path(final_out_dir, "MF_GO_All_Enrichment_Results.xlsx"))
filtered_enrich_named <- setNames(
filtered_results_go_enrichment,
names(filtered_results_go_enrichment)
)
# Save to Excel
writexl::write_xlsx(filtered_enrich_named, path = file.path(final_out_dir, "MF_GO_Filtered_Enrichment_Results.xlsx"))
# Function to extract GO term IDs from results
extract_go_ids <- function(df) {
if (nrow(df) > 0) { # Check if dataframe is not empty
return(df$ID)
} else {
return(character(0)) # Return empty character vector if no GO terms
}
}
# Extract GO term IDs from each filtered result
list_of_go_ids <- lapply(filtered_results_go_enrichment, extract_go_ids)
# Assuming list_of_go_ids is available and named per method or condition
names(list_of_go_ids) <- names(filtered_results_go_enrichment) # Ensure names are properly assigned
# Combine all GO term IDs into a single dataframe
all_go_ids <- unique(unlist(list_of_go_ids))
binary_matrix <- sapply(list_of_go_ids, function(ids) as.integer(all_go_ids %in% ids))
# Convert to dataframe for UpSetR
binary_df <- data.frame(binary_matrix, row.names = all_go_ids)
names(binary_df) <- rev(names(filtered_results_go_enrichment))
upset_plot <-ComplexUpset::upset(data=binary_df,
intersect=names(binary_df), name="Extraction Methods",
wrap=TRUE,
base_annotations=list(
'Intersection size'=intersection_size(
text=list(
size = 2.5
)
)
),
set_sizes=(
upset_set_size()
+ geom_text(aes(label=..count..), hjust=1.1, stat='count')
+ annotate(geom='text', label='@', x='Count', y=850, color='white', size=3)
+ expand_limits(y=7000)
),
sort_sets=FALSE,
queries=list(
upset_query(intersect = c("1step"), color = "orchid", fill='orchid', only_components=c('intersections_matrix', 'Intersection size')),
upset_query(intersect = c("1step+C"), color = "orchid", fill='orchid', only_components=c('intersections_matrix', 'Intersection size')),
upset_query(intersect = c("2step"), color = "orchid", fill='orchid', only_components=c('intersections_matrix', 'Intersection size')),
upset_query(intersect = c("2step+M"), color = "orchid", fill='orchid', only_components=c('intersections_matrix', 'Intersection size')),
upset_query(intersect = c("4step"), color = "orchid", fill='orchid', only_components=c('intersections_matrix', 'Intersection size'))
),
sort_intersections_by='cardinality') + ggtitle('Intersection of Molecular Function (MF) GO Terms in Extraction Methods')
# Save the plot
ggsave(
filename = file.path(final_out_dir, "Comparison_of_MF_GO_Terms_per_Method.png"),
plot = upset_plot,
width = 10,
height = 6,
bg = 'white'
)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_segment()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
# Print the plot to the console
print(upset_plot)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_segment()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
all_ids <- lapply(filtered_results_go_enrichment, function(df) df$ID)
top10_exclusive_results <- list()
for (name in names(filtered_results_go_enrichment)) {
this_df <- filtered_results_go_enrichment[[name]]
this_ids <- this_df$ID
other_ids <- unique(unlist(all_ids[names(all_ids) != name]))
exclusive_df <- this_df[this_df$ID %in% setdiff(this_ids, other_ids), ]
top10 <- exclusive_df %>% arrange(p.adjust) %>% head(10)
top10_exclusive_results[[name]] <- top10
}
writexl::write_xlsx(top10_exclusive_results, path = file.path(final_out_dir, "MF_GO_Filtered_Top10_Exclusive_Results.xlsx"))